rOOJHLNT RESUME 



ED 048 330 



TB 000 382 



AUTHOR 

TITLE 

INSTITUTION 
SPONS AGENCY 

PUB DATE 
NOTE 



Humphreys, Lloyd 0%; And Others 

Project on Techniques of Objective Factor Analysis. 
Illinois Univ., Champaign. 

Cffico of Naval Research, Washington, D.C« 
Psychological Sciences Div. 

Apr 70 

1 17p. 



EDRS PRICE EDRS Price MI-S0.6F hC-$6.56 

DESCRIPTORS Ability, Correlation, ^Factor Analysis, ^Factor 

Structure,* Group Intelligence Tests, Group Tests, 
Homogeneous Grouping, Individual Tests, 
^Intelligence, ^Personnel, Predictive Validity, 

* Psychologica 1 Tests, Psychometrics, Test 
Construction, Test Reliability, Test Validity 



ABSTRACT 

This collection of papers, concerned with the nature 
and theory of intelligence, forms part of a project to integrate test 
and factor theory with the empirical, functional relationships 
involving standard intelligence tests. The project will render more 
objective the use of factor analysis in personnel research. A 
definition of intelligence encompassing biological and 
socio-psycho logical factors is formulated in "Theory of 
Intelligence." Three classes of hypothesis are presented in 
"Hypothesis Developed From the Theory." In 11 The Psychological Test" a 
psychological theory is delineated as a basis for developing a theory 
of intelligence congruent with the experimental and observational 
correlates of measures of intelligence. Interrelations of 
homogeneity, reliability, and validity are considered in the paper of 
this name. "Illustrations of Test Characteristics by Means o f 
Physical A na Icgues , 11 "Evaluating the Importance of Factors in any 
Given Order of Factoring," and a description of "Ihe Scottish Survey 
of Intelligence" complete the collection. (Author/CK) 
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The several separate manuscripts accompanying this 
preface were destined ic be chapters in a rocnogreph 
on the nature of intelligence* The plan was to 
integrate both test and factor theory with the 
empirical, functional relationships involving 
standard tests of intelligence. The completion of 
this work has been delayed by a sudden shift in 
career of the author. 

This work was supported in part by the Personnel 
and Training Research Programs, Psychological 
Sciences Division, Office of Naval Research, under 
Contract N00014-67-A-0305-0012 and in part by a 
sabbatical year supported by the University of 
Illinois. t i s being forwarded now at* a technical 
report under the contract in the hope that, even in 
its incomplete Btete, the material contained therein 
Will be of use to personnel research activities. 

Reproduction in whole or in part is permitted for 
any purpose of the United States Government. 

Lloyd G. Humphreys 

National Science Foundation 
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The initial thrust of the work under this contract was toward the development 

of empirically tested procedures designed to make more objective the use of factor 

analysis in personnel research. Later work has moved into more substantive 

applications. Ability theory, for example, has been closely tied to factor 

analytic techniques from the earliest work of Spearman In this field. What 

in 

factor analysis can and cannot do, and the confidence that one can place/the result^ 
are tied intimately to modern ability theory. 
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THEORY OF INTELLIGENCE 



It Is necessary, as a first step, to formulate a definition of intelligence. 
The usual criterion for a definition is, of course, that the term in question in 
conjunction with other terms in the theory lead to testable hypotheses. The 
definition must lead to scientifically useful consequences. It is also reasonable 
• • to eaploy a secondary criterion on occasion. Since intelligence tests are in 
common use, and since these tests have become firmly entrenched in this society, 

/ 

the definition of intelligence should be tied directly to available measuring 
devices. This second criterion le compatible with a philosophy of science that 
does not dictate an operational definition for every concept in the theory, but 
it is more convenient to have operational definition# for certain terms in the 
theory than for others. 

Definition of Intelligence . Intelligence le defined as the entire repertoire 
of acquired skills, knowledge, Iteming sets, and ganeralUatlon tendencies con- 
sidered Intellectual in nature that art available at any one period of time. An 
Intelligence test contains items that sample the totality of such acquisitions. 
Intelligence so defined is not an entity such as Spearman’s "mental energy." In- 
stead the definition suggests the Thomson "multiple bonds" approach. Nevertheless 
for the sake of convenience Intelligence will be discussed as if it were a unitary 
disposition to solve Intellectual problems. 

There Is one important difference from Thomson's multiple bonds, at least as 
the Thomson theory has at times been interpreted, that ehould be clarified. It 
le not essential that the person whose Intelligence is measured have acquired a 
specific response to each stimulus or eat of stimuli presented. Learning sets 
and generalisation tendencies were Introduced in the definition to preclude 
critical Interpretations of this type. 

The definition of intelligence here proposed would be circular as a function 
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of the use of "Intellectual" If It were not for the fact that there Is a consensus 
among psychologists as to the kinds of behavlovs that are labelled Intellectual. 

Thus the Stanford^Blnet and the Wechsler tests can be considered e- mples of this 
consensus and define the consensus. It Is also true that a present consensus does 
not rigidly define Intellectual for all time to come* One should expect change 
to occur. This change will come slowly, however, because the process of changing 
the definition of a test In terms of the Items composing It Is a slow one. As the 
empirical basis for change primary reliance must be placed on functional relation- 
ships Involving the total score on the test. 

Contrast with Oloer Operational Ism . This definition differs from the statement 
that intelligence is what Intelligence tests measure. When the intercorrelat ions 
of several different Intelligence tests do not approximate unity closely after 
correction for attenuation, the strict opera tionalist Is left with as many different 
definitions of intelligence as there are tests. From the present point of view, 
however, one would not expect different tests to be perfectly correlated since each 
samples a domain that Is fairly heterogeneous with a limited number of items. 
Parallel forms of the tame test thould be more highly correlated than different 
Intelligence teats since In tha former there Is no Item sampling error and there 
Is near Identity of parallel Items. 

A problem arises In trying to set s desired height of Intercorrelations of 
tests sampling from the same domain. There Is no easy answer. An a priori 
approach la not possible since a great deal depends on the number of Items In 
each test and tha degree of homogenlety of the domain. A combination of a 
rational analysis of the content of the testa In question plus s distribution of 
the lntercorrelatlons of the proposed tests provider a partial answer. Tests 
of satisfactory reliability but whose correlations with other Intelligence tests 
a psrt of the main distribution of such correlations can be considered 
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Inadequate representatives of the domain. By this criterion a typical culture fair 
test of Intelligence Is not an acceptable measure of Intelligence at this point In 
time. 

A second difference between the two approaches to definition Is that the pre- 
sent one fits Into a larger context. Knowledge of learning and of the constitu- 
tional bases for learning become Important. As a result the definition here 
proposed leads to testable hypotheses concerning intelligence. 

A third difference between the present definition and rhe older, more super- 
ficial operatlonallso la that a distinction Is made between the repertoire of 
responses, which la Intelligence as here: defined, end the eliciting of those 
responses on the teat. A person vhosi repertoire of responses Is for some reason 
not available at the time the test is sdwlnl# tored can still be intelligent. This 
distinction Is often phrased in the psychological literature as that between 
learning and performance, but the emphasis here la between acquired knowledge and 
skill, on the one hand, and performance on the other. 

Discrepancies between Intelligence and performance on an intelligence test 
can conceivably arise In a very large number of ways. The test constructer and 
the test administrator try to minimise the discrepancies by writing reliable, 
unambiguous Items, by standardising the conditions of test administration, and by 
specifying the populations of persons and the set of situations for which the 
test Is appropriate. How succsssful such efforts are Is an empirical matter and 
cannot be evaluated in the arm chair. A useful generalisation from a great deal 
of such research is that Intellectual performance Is relatively robust. It la 
not affected substantially by many of ths a priori possibilities. This finding 
should not, however, be taken as an excuae for careless or unsophisticated use of 
Intelligence tests. 

O 
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Biological Substrata . Since most theorists have defined Intelligence as a 
capacity, generally fixed by Inheritance, It is necessary to specify the reasons 
why this oeems undesirable. It should be clearly understood at the outset that 
the present writer does not exclude the possibility, or rather probability, that 
constitutional differences among men affect the ease with which Intellectual dis- 
positions are acquired. He prefers the term "biological substrate" for Intelli- 
gence to cover these differences while Intelligence is reserved for the acquired 
disposition. 

Biological differences can arise from many causes. In addition to genetically 
determined differences, biological differences can be acquired prenatally, per- 
lnatally, and poetnatally. Furthermore, the genetically determined differences 
sru far from unitary. Instead the genes are responsible for a huge complex of 
anatomical and biochemical factors. It Is extremely doubtful that physiological 
psychologists are going to find a single key to the differential facility the 
human possesses in the acquisition of Intellectual dispositions. Biological sub- 
strate and genetic substrate, respectively, for Intellectual performances are more 
appropriate term* than a word which suggests an entity. 

From the point of view of the user of an Intelligence test the most Important 
reason for not defining Intelligence in terms of a genetic substrate Is that a 
given person's standing with respect to genetic factors can not be lnfer/ed from a 
test score. The test measures acquired behavior. Independent assessment of the 
genetically determined biological base Is presently possible for only a tiny por- 
tion of the human population, a. g., phenylketonour la . Some few of the acquired 
organic differences can be Independently assessed, e. g., certain of the birth 
"Injuries." Experimental control Is lacking In studies of human genetics so that 
it la even Impossible to draw conclusions about relative contribution to variance 
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of genetic factors In an analyala of variance design. 

The construct of a genetic substrate for Intelligence Is required more by 
general biological knowledge and belief In biological continuity from lower animals 
co man than by good Information concerning human genetics. Family relationship 
and other experimentally uncontrolled studies of human genetics are auggeatlve but 
not convincing. It Is difficult to believe, however, that the controlled breeding 
studlea of behavioral tralta In lover animals could not be duplicated with the 
human If controls were possible. More basic to this line of reasoning la the 
Inference that any Inter-specleo difference will also show Intra-apeclea differ- 
ences. There are clearest differences between man and other primates In the 
genetic substrate for Intelligence. It Is reasonable to assume that Individual 
men will also differ In their genetic substrate for use ol symbols, abstract rea- 
soning and problem solving, etc. 

While a biological aubatrate for Intelligence Is made neceasary by biological 
knowledge, the construct can not at the present time enter into testable hypotheses 
In any except the most general fashion. Any given organism may have Innate capa- 
city for the development of hln Intelligence, but the limits of this ere very 
nebulous Indeed. This capacity, furthermore, Is not necessarily fixed at a given 
level throughout the life span. There may be genetically determined differences 
In the rate of maturation and of decline of the biological aubatrate that will In- 
fluence Individual differences In Intelligence. It la safe to conclude that no 
amount of training will transform a chlmpantee Into a human being Intellectually, 
or a Mongoloid Into a genius, but present data do not allow much more specific 
Inferences than theae. 

Psycho-socia l Substrate . Por basically th« aame reason that a teat user can 
not draw Inferences concerning ganatlc causes from a teat score, he can not draw 
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inferences concerning environmental causes from a test score. Bach human being is 
biologically unique. Two different biological organisms developing In seemingly 
identical environments will acquire different intellectual repertoires. Identical 
biological organisms developing in different environments will also acquire differ- 
ent Intellectual repertoires. It is also true that similar repertoires can result 
from different nixes of heredity and environment. It is useful, therefore, to 
define a concept parallel to the biological substrate: namely, the psycho-social 

substrate. The psycho-social substrate for Intelligence la Just as Important as 
the biological substrate, but Is almost equally difficult to assess independently. 
Furthermore, the two are by no means orthogonal* Pvobable genetic differences 
among social classes, for example, accompany psycho*soclal differences. 

It was stated earlier with respect to the biological aubstrate that only the 
most general 6orts of Inferences could be drawn legitimately. The same Is true 
concerning the psycho-social substrate. If a man were raised In Isolation, his 
Intelligence would be very low. Quasi experimental approaches to thl6 condition 
are furnished by canal boat and gypsy children (Ar.astasl, 1958). It Is alao 
probable that one could increase the quality of the psycho-soclel substrate with 
respect to developing Intelligence and obtain an increase In Intellectual level, 
but relatively little is known experimentally about this matter. Again, n quasi 
experimental approach to this problem Is furnished by the comparison of Intelli- 
gence of World War I and World War II drafteea (Tuddenham, 1948) and of the World 
V/ar II and 1963 norms of the Air Force Classification teats (Tupes and Shaycoft, 
L964). The results are quite dramatic. Between the two World Wars, the Increase 
{mounted to approximately one standard deviation of the World War I distribution 
while subsequent to World War II the Increase appears to be about one-half of a 
standard deviation. 
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In summary, response acquisition requires bo ;h a biological (Including 
genetic) substrate end a psycho-social substrate which Interact throughout the 
life span. Responses are acquired, and lost, dur ’.rg development, maturity, and 
decay, I'he test user can not draw specific Inferences from a subject's test 
acorn about either of the two substrates. 

Types of Behavioral Repertoires . A distinction Is drawn treul .lonally between 
Intelligence and achievement testa. A naive statement of the difference Is that 
the Intelligence teat measures capac'.y to learn r.nd the achievement test measures 
what has been learned. But Items In all psychological and educational teats 
measure acquired behavior. The measures of even the simplest sensory and motor 
functions require a background of learning In order for the examinee to understand 
the directions and to provide answers. 

A statement that recognizes the Incongruity of a behevioral measure as a 
measure of capacity Is that Intelligence testa contain Items that all examinees 
hove had an equal opportunity to learn. This statement can be dismissed as false 
on Its face. The paycho-soelal substrate la simply not equal for all. Opportunity 
depends on the characteristics of father and mother, siblings, other relatives, 
friends, the neighborhood, tha schools, and ether wwironment . There la no merit 
In maintaining a fiction. There Is also no merit In belaboring this fiction as 
an argument against the use of testa. 

Intelligence Is here defined as the totality of responses available to thi 
organism at any one period of tine for the solutr.cn of intellectual problems. 
Intellectual Is defined by a consanaui among psychologists. The Intelligence 
teat samples the responses In the subject's repeendre at the time of testing. 

So defined, th«re are no differences In kind between Intelligence and achievement, 
or between aptitude and achievement. There are Instead three dimensions appro- 
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prlate to Che description of tests and the repertoires they sample (Humphreys, 
1962)* There are quantitative differences among different types of tests on 
these dimensions. 



1. The most Important of these dimensions Is breadth. An Intelligence test 
Is ouch broader In coverage than Individual achievement tests* Concurrent corre- 
lations between Intelligence And achievement In a specific subject matter arc 
quite high, but far from perfect. When a number of achievement teste In different 
subject matters are administered, thus achieving greater breadth on the achieve- 
ment side, the total score obtained from the test battery Is very highly correlated 
with measured intelligence. As a matter of fact, this correlation is about as 
high as the lntercorrela done among recognised tests of intelligence* 

2. A second dimension of difference Is the extent to which a test is defined 
by a specific educational program. The achievement test Is tied to a particular 
academic curriculum while the Intelligence teat samples both learning In school 
and out of school. An achievement test rauut be revised when the course of study 
changes while an Intelligence test Is more Independent of what Is being taught In 
a particular school at a particular period of time. The psycho-social substrate 
for the achievement test Is more narrowly defined. 

3* A third dimension of difference le thj recency of the learning sampled. 

The achievement test measures recent learning primarily while the Intelligence 
test samples older learning. Thus 8th grade arithmetic Is a part of the "apti- 
tude" section of the College Board teste and high school algebra Is tapped by the 
"aptitude" section of the Graduate Record Examination, but similar questions 
administered In the 8th or 9th grade would be achievement teems. 

The use of aptitude requires additional clarification. Tha term Is used 
corroonly for one of the components of general Intelligence as well as lor an 
i O not considered a component of Intelligence. The former le the sense of 
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Its use by the College Board and the Graduate Record Examination. Aptitude is 
also used at times in a very general sense to Include both intelligence and non- 
intellectual abilities. No matter how used, however, there is no problem in 
fitting aptitude into the present analysis of differences among test items and the 
behavioral repertoires they sample. When used narrowly, aptitude and intelligence 
tests differ on the first dimension, but not on the second and third. Both apti- 
tude and achievement teats would b« classified as narrow, but an aptitude test In 
contrast to an achievement test assesses older learning that la not restricted to 
the classroom. 

The dimensional analysis la useful In Indicating why there la confusion con* 
ccrnlng the proper category In which to place certain teBts. Just because differ* 
ences among test Iter? e*e quantitative and not qualitative, It Is possible for 
one man's Intelligence test to be another man's achievement test. Thus Jensen 
(1968) categorizes the National Merit Scholarship Examination as an intelligence 
test, but precisely the same Items are used in the Iowa Tests of Educational 
Development for assessing achievement. Frequently, the distinction between 
achievement and intelligence (or aptitude) testa is stated In termc of the purpose 
for which the test is used (Wesman, 1968). Purpose is Independent of type of 
item. A teat used for the prediction of future performance is called an aptitude 
teit whlU the same test used to evaluate learning la called an achievement test. 
Thus, there la no conflict between the present definition of intelligence and the 
types of items used in measuring achievement and aptitude. 

Contributions of Learning to Theory . Several different well established 
principles of '.earning contribute to the theory of Intelligence being developed. 
The principles that are moat useful are very broad and are also Independent of the 
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nuance, of various learning theorlea. They night be aald to be within the public 
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domain of accepted psychological knowledge* 

1. One of the most Important principles of learning for the development of 
Intelligence is the presence of an intellectual psycho-social substrate. No one 
can learn to use abstract words who has had no contact with language. In the 
school the parallel principle Is that of curriculum. A student will not acquire 
mathematical knowledge and skill* who has had no exposure to roatliemat ice. Note, 
furthermore, that It Is exposure, not adequacy of exposure, that is the issue. 

In experimental attacks on type of exposure, type makes little contribution to 
variance. There are many cases also in which che exposure was highly Ideosyncratlc, 
e. g., Abraham Lincoln studying by fire light. 

2 , There must be motivation cr incentive to learn. Motivation may be positive 
or negative, Intrinsic or extrinsic, but must be present in some form. This 
statement of principle is Intended to avoid an issue important In the psychology of 
leaving. While reinforcement for some theorists Is an essential part of the 
mechanism of learning, for others reinforcement le necessary for performance but 
not for learning per se. Nevertheless, all theorists acknowledge the Importance 

of motivation for Increased effectiveness of performance. Latent or incidental 
learning may exist, but It Is very Inefficient, and motivation le required for 
performance . 

Given the fact that children differ In the type and degree of motivation for 
intellectual learning at a given moment in time, what is the source for these 
differences? There are again biological and psycho-social substrates for motiva- 
tion ss well as for Intelligence. In this case the psycho-social substrate In- 
cludes both the reinforcement history and current situational factors. In the 
absence of ability to msnlpulate the genetic substrate, for one who is Interested 

in changing the course vf future learning the necessary procedure Is to control 
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type of exposure and Co reinforce Che behavior desired. 

3. Forgetting Is very slow for well learned or overlearned behavior. Given 
occasional rehearsal of learned behavior, pracdcally no forgeedng occurs. This 
means vlch respecC Co Che development of InCelllgence ChatHhe IntellecCual 
repercolre conClnues co grow as long as the subjecc remains In an Intellectus 1 
environment. This environment does noC need Co be an academic environment since 
an educacsd man cast away on an uninhabited Island with a set of encyclopedias 
could 8 C 1 11 remain In an Intellectual environment. There will be so little loss, 
In comparison with gain, for students during the school years that loss can be 
disregarded. For purposes of assessing the gain a total score uncorrected for 
differences in chronological age must be used; 1 . a., mental age units are ade- 
quate, but Intelligence quotient units are not. With respect to the latter a 
person who does not show as much growth as his fellows will show a loss In I. Q. 

4. Transfer of training takes place typically within a domain that the man 
on the street would consider quite narrow. In general measured transfer turns 
out to be less than nonpsychologists assume will be the case. For the development 
of Intelligence this means that a great many relatively specific learnings have 

to take place. Primates can develop learning sets, but Harlow's monkeys learn 
relatively narrow sots (Hsrlow, 1949), e. g., the odd stimulus among a set of 
three. It takes esch monkey a relatively large number of trials to acquire each 
such set. While the human brings to the learning situation a different and mote 
efficient constitutional rubstrate for the acquisition of learning sets, or con- 
cepts, than doss the monkey, It Is still necessary for the human to acquire a very 
large number of these within the intellectual domain. (The number of these In the 
human Is Indicated roughly by the site of his comprehensive vocabulary.) While 
he does not have to acquire separately and Individually each specific response 
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that psychologists would labie intellectual , even the number of learning sets or 
generalization tendencies la very large so that a great deal of time is required 
for the learning. 

5. Transfer is not only fairly narrow, but it can also be both positive and 
negative. Proactive inhibition is juat as important aa proactive facilitation. 

Or, to revert to terms that are more common in the literature of individual dif- 
ferences, a person can as readily acquire a disability as an ability. Certain 
dlsabllltler are quite stable and quite resistant to change. Thus every person 
acquires to a greater or less degree s disability to speak a foreign language 
v thout accent. Few adults are able to overcome this disability. There are a 
very large number of items in the intellectual repertoire and each of these has 
both positive and negative effects on future response acquisition. 

Contribution of Biology to Theory . Again, only the most general principles 
will be described. Unfortunately, the number of principles and their specificity 
in this area are not as directly pertinent to the development of intelligence aa 
are the principles of learning. This arises because of the difficulties attendant 
upon doing controlled experimental work on the functioning of the human central 
nervous system and upon human genetics. 

1* The companion principle to the flrat learning principle is that the sub- 
ject must have a minimally adequate biological substrate. Persons showing the 
lowest levels of intelligence typically h£ve biologically Inadequate organisms. 
Children with phenylketonourla, Mongolism, cretinism, etc. will not be able to 
acquire Intellectual behavior at e normal rate. Their capacity to learn is not 
well defined, and can be drastically underestimated, but capacity la none the less 
limited by their biological limitations. 

2. The Important distinction between phenotype and genotype is meaningless 
O ess there is Independent assessment of the genotype* A diagnosis of genetically 
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determined feeble mlndedness from a test score la not possible. The combination 
of psycho-aoclal and biological substrates leading to performance at the moron 
level may differ widely from one person to another who test at that level. It la 
useful at this point to repeat the injunction presented earlier: namely. It Is 

Impossible to draw caueal Implications concerning any substrate from the test 
score alone. 

3. Bach human being la biologically unique as a function of the number of 
chromosomes and number of genes In the genetic substrate and the large number of 
biological effects of events In the prenatal, perinatal, and postnatal environ- 
ments. It la not even necessary to exclude monozygotic twins In making this 
statement, although the uniqueness of genotypes must be discarded for such twins. 
In spite of the uniqueness of genotypes, It la also true that there Is a cluster- 
ing of sorts among genotypes. This arises from the partial segregation of gene 
pools In sub* populat Iona of the human species. 

4. The biological substrate for intelligence Includes a very large number of 
specific anatomical structures, physiological functions, and biochemical agents. 

It Is highly probable that there are genetically determined Individual differences 
In each of these and that these individual differences are for the most part Inde- 
pendent of each other. The characteristics of all synapses In a given organism 
can probably not be determined from those of e particular synapse, or the char- 
acteristics of one ganglion In a given organism are not those of all ganglia. 

There are also possible a multitude of environmental effects on the biological 
organism that start at the moment of conception and extend throughout the life 
apan • 

Developmenta l Principles . There are at least two important principles for a 
theory of Intelligence that can not be clearly distinguished as either learning 
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principled or biological principles* Both maturation and learning are presumably 
Involved* 

1. A person’s present behavioral repertoire Is an Imperfect predictor of a 
future repertoire. This principle has been well documented by Fleishman and 
associates for motor learning (1954, 1955, I960)* Early trials are not correlated 
nearly ae highly with later trials as adjacent trials are to each other. For 
the Intellectual repertoire the principle haa been aubs tantlar.ed by Anderson (1940) 
and Koff (1941)* These latter Investigators found that gains In mental age from 
year to year were Independent of the babe mental age at the start of the year. 

There la ample £ priori rationale for this principle. There Is a g**eat deal 
of seeming randomness In anyone 1 * environment which will affect the psycho-social 
substrate and even at times the biological substrate for intelligence. The school 
a child attends, the particular teacher to whom a child happens to be assigned, 
the particular peer group he happens to become intimate with, the characteristics 
of his parents and siblings, accidents producing nervous aystem Injuries, Ill- 
nesses leaving neurrl defects, all of these Impinge on the developing organism and 
Interact with hla current status. Such Influences, e. g., characteristics of 
parents and slbs, are only partially correlated at best with the charsc terlat les 
of the child, this means that motivation to learn fluctuates somewhat unpre- 
dlctably and exposure to various kinds of learning la somewhat unpredictable. 

Both lead to unpredictability of future learning and thus to an uncertain future 
repertoire . 

Biological development also does not proceed at the same rate for all 
structures nor for all Individuals, Those who arrive at sexual maturity early 
tend to be taller than their age nates at that time, but achieve shorter adult 
height. There Is a possible genetic basis for differential growth retea that 
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would account for reduced correlations between present status and future develop- 
ment. Thus it is not possible to rule out unevenness In biological development 
as at least a partial cause of the findings of Anderson and Roff. There Is a 
seeming randomness In both the biological and psycho-social substrates that leads 
to imperfect predictions of future status. 

2. Desirable human characteristics tend to be poeltlvely correlated with 
each other. Thla principle Is particularly evident In unaelected sampled from 
the entire population. For example, In an American or Weatern European population 
the correlation between height and intelligence Is approximately .25. There la 
evidence (Husen, 1959) that this relationship Is not genetically determined but 
that It may be determined prenatally. As another example, the ability to make 
simple perceptual discriminations Is positively correlated with general verbal 
knowledge. Some of these positive correlations may be determined genetically, 
some by the psycho-social environment, and some by biological "accidents. " What- 
ever the explanation may be, however, the principle la Important for a theory of 
Intelligence. 

Summary . This chapter Introduced n behavioral definition of Intelligence 
that goes beyond the simple statement that intelligence Is what Intelligence tests 
measure. The behavioral repertoire that la called Intelligence snd that la 
sampled under controlled conditions by lntelligenca tests, develops out of 
biological, Including genetic, and psycho-social substrates, but without Indepen- 
dent assessment of these substrates it la not possible to make Inferences about 
them from a test score. 

From this definition It follows that there are no qualitative differences 
among Intelligence, aptitude, and achievement, but there are quantitative differ- 
ences along three separate dimensions. These are the breadth of the repertoire, 
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lti age, and its tie or Lack thereof to a specific educational experience, 
thaae defined properties of the concept of Intelligence and from some very 
principles of learning, genetics, and development, testable hypotheses can 
derived. These are presented In the next chapter* 
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HYPOTHESES DEVELOPED PROM THt THEORY 



Hypotheses derived from Che theory are present! d In this chapter. The 
deductions are not tight because the theoretical sti tementa are not quantitative. 
Quantitative statement* can come only from more and bptter data, and the rigor of 
the deductions Is not Important if there is a consensus that the conclusions do 
Indeed follow from the theory. At any rate, the chick of theorem against data Is 
the conclusive step in the enterprise. 

It It obvious that many of the hypotheses are circular; 1. e., the theory wa9 
derived from the data concerning Intelligence tests and the nature of Intelligence 
teats, and the '’tests" of tha theory had known outcomes at the time the hypotheses 
were derived. Certain ona& represent predictions from the theory for which data 
are not available, however, and consequently rapresant better checks on lta 
adequacy. 

Three Important classes of hypotheses will be discussed. One class Includes 
effects on mean performance of groups. A second c.lll Includes effects on sta- 
bility of individual differences. The third class Includes predictions or con- 
current Inferences made from Intelligence teats. 3oth of the latter classes In- 
volve effects on correlations, but In the second class the emphasis Is on the 
stability of Intellectual performances while in th» third class the emphasis Is on 
generality. 

Mean Performance of Croups . A fsv of the hypviheiea that follow from the 
theory are almost trlvlol, but are worth ata:lng a« an antidote tv common psycho- 
logical and lay thinking concerning the fixed nature of Intelligence. It must 
also bo remembered that changes in means will be represented by arbitrary scales 
of measurement with a mean and standard deviation based upon the performance of 
some reference group. Change la expressed in age or grade units, or in standard 
scores within age or grade groups. 
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When dan are used to support hypotheses In this area. It must also be recog- 
nised that experimental control# are frequently lacking. Statistical control In- 
volving some variant of the partial correlational technique such as covariance 
analysis Is never a complete substitute for the control obtained through random 
assignment of subjects to experimental groups. For one thing, measurement error 
reduces the accuracy of statistical control, Failms to measure au important 
component of variance le a second source of Inadequate control. It Is also possible 
to control too much variance statistically and, as It yere, throw the baby out with 
the bath. Pnrtlalllng reading comprehension measures out of relationships In- 
volving Intelligence teats would be considered suspect by most Investigators, 

There would be more dibate concerning the pa>rtialling out of a measure of socio- 
economic status from those same relationship!!. The presence of debate and the 
Lack of objective answers on such issues indicates all too clearly the hazards 
Involved. The lack of experimental control doee not mean that research work 
should cease on important problems. It does mean that a careful Investigator will 
be modest with respect to the conclusions he draws from his data. 

1. Change will occur. The evidence here was referred to earlier. Isolated 

nnd depr'nd groups show progressive declinen In intelligence. The population of 
the United States, as evaluatsd by military testa, has shown a progressive 
increase In intelligence. Scottish children between 1933 and 1947 showed an In- 
crease In intelligence, as measured by the gtoup test administered on both occa- 
sion* (Scottish Council for Research in Education, 1949), After equating the 
l'Ub *nd 1937 Stanford- Ulnet Individual tests oy an Inadequate methodology, how- 
ever, the original lnveq ' lga t< rs concluded that "real" Intelligence, 1. e., that 
uomurrd by an Individual test, had not risen. The present writer using published 
data nnd a more adequate methodology (Humphreys, 1970) has shown that the 
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Individual test results were almost completely parallel to the group test results. 
The Scottish gain Is not as large as the American gain, but the Scottish retest 
occurred Immediately after the close of World War II. The children tested \* j d not, 
by any means, had a normal Scottish educational experience. 

2. Measurable change will occur only with thn expenditure of substantial 
effort. The lltorature concerning the affects of various educational methods Is 
pertinent here. Tjalnlng experiments lasting up to one semester and Involving an 
hour or less per day have little differential effect on performance. The effects 
of brief cramming or review aaseloni prior to taking an Intelligence (or college 
entrance) test are consistently very small. Nationwide testing program sponsors 
advise students that cramming will do little good. Yet when a young man attends 
a preparatory academy fu)-*tloe for a year, the Increase In scores on tests of 
the College Board averages approximately 100 points on the three-digit scale 
(Marron, 1965). Data for the asperate tests are presented In Table l. Marron 
also found that some preparatory schools produced greater gains than others, but 
no attempt was made to explain these differences. 

Census figures show that the educational level of our population has risen 
In each decade. These figures reflect a very substantial additional educational 
effort over the years between the two World Wars and may well be a primary causal 
factor In the measured Increase In Intelligence over that same period. Further- 
more, there has been some decrease In tho growth rate of years of formal educa- 
tion since World War II, and thnre has been a corresponding decrease In the growth 
rate of Intelligence. 

3. For a given levol of effort there will be greater effects on young chil- 
dren than on older children. Growth curves of Intelligence as a function of age 
certainly do show decreasing returns with increase In age, tut this finding Is not 
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quite to the point with respect to the application of special efforts to facili- 
tate growth In Intelligence with different age grouoe. There Is also a problem 
with regard to the unite of measurement used since there Is a general consensus 
that either mental or educational age unlta decrease In site with Increase In 
> chronological af,e. If change were measured In these units, empirical findings 
would almost certainly be the reverse of those expected on the basis of the 
hypothesis. Change must be measured, therefore, In relative units such as standard 
score or classical Intelligence quotient unite. If the problems Involved In the 
measurement of both effort and Intellectual growth ere solved, however, It should 
be easier to obteln change when the ropertolre Is smell than when It la large. 

4. Changes In Intelligence ere e function of the kind of intervening educa- 
tional experiences. (Exposure to the treditlonal academic curriculum, with atten- 
tion to the problem of the learner's motivation, should be effective In producing 
change In Intelligence. Techniques of Instruction conducive to the formation of 
learning sets and generalisation tendencies should also be effective In producing 
change . 

/ good many years ago, Lorge (1945) published data on the relationship 
between retest gains on an Intelligence test ard Intervening educational exper- 
ience. There are also good recent date published by e Swedish investigator for 
changes between 13 end 18 IHarnqulat, 1966). Since dlfforent teats were used at 
the two age periods, Harnqulat obtained canonical composites. The major compari- 
sons Involve the first canonical composite which has reliabilities of .94) and 
.932 for the Initial and final measures, respectively* The metric differs on the 
two occasions, however, with the Initial standard deviation being 10,10 And the 
final 8.37. There may bo a small celling effect on the flnsl canonical composite. 

Table 2 summarises two of Harnqulst's estimates of gain in the major educa- 
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tlonal groupings suggested by hlo. Also Included are data for regressions of scores 
based upon the eatlmated within groups reliability of the canonical composite com- 
puted by the piesent writer* Hcrnqulet concluded that the gains computed from 
scores regressed In accordance with the estimated "true" stabilities of the tests 
1 were probably most valid. Gains based upon difference scores corrected for differ* 
onces In metric represent the most conservative estimate of gain. Gains computed 
from estimated reliabilities are Intermediate. 

The present writer has little confidence that he has the last word on the 
most appropriate method of estimating gain from these data. It does not seem 
reasonable to use stability coefficients, either obtained or corrected for errors 
of measurement, since the experimental conditions affecting the means also pre* 
rumabiy affect the stabilities* On the other hand, something other than correcting 
for a change in metric Is in order* The intermediate values based upon reliability 
arc more conservative char, those based upon stability of measures over time. By 
any method, however, gAlns are differentially associated with the amount and type 
of Intervening education. While gains for the higher groups may be somewhat 
attenuated by toe celling effect, It Is also evident that the differences among 
groups are not spectacularly large, and there la touch overlop. Many other factors 
beyond formal schooling are obviously Involved. Since subjects were not assigned 
to groups at random, caution le«also Indicated concerning attributions of cause. 

Any laboratory analogue, on the othor hand, must be considerably lee3 realistic 
■ linn the present "experiment" which lasted for 5 years. 

0. For Intellectual gvowth there must be a continuous supportive paycho- 
foclnl substrate. There Is no magic key or no critical tine for Intellectual 
stimulation. Temporarily successful Head Start type programs will not be success- 
ful In the long run If the children Immediately revert to the prior psycho-social 
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environment. There roust be continuing exposure and continuing effort. The ex- 
posure can be readily manipulated by the society, but the effort required is that 
of the Learner. Social effort that does not affect individual effort will not 
pay off. Current evidence concerning these issues Is almost entirely lacking, but 
the obtaining of such evidence is one of the most critical reaearch issues of our 
time. It is alao a difficult research area. 

6. Change is slower for intelligence than fot more narrowly defined abilities. 
Intervention In a narrow area will produce more rapid j nd larger amounts of change 
than intervention In a broad area. Differential gains on the so-called aptitude 
and achievement tests of the College Board resulting from preparatory school 
experience are relevant. Table 1 presented earlier showed that there is greater 
gain in English and in Mathematics achievement than in Verbal or Mathematical 
aptitude. 

7. There will be little decline in Intellectual performance in the absence 
of clearly discernible biological deterioration. Since there is little forgetting 
of overlearned and continuously practiced skills, the repertoire should not shrink. 
Older data seemingly contradict this statement. The more recent and better con- 
trolled research, however, indicates that the well documented decline 19 the 
result of failure to control Intergenerational differences in intelligence. Older 
research was cntltoly cross-sectional. Cohorts of different ages were measured 

at the same point in time. The more recent data (Schale, 1965) Involved measuring 
different cohorts at the same point in time, but an additional te9t administra t icn 
was joqulred of the samo cohorts five years later. The analysis of variance 
allows ono to estimate contributions to variance of age cohort and of aging with 
the result that the former is found to make the main contribution. A reonalyala 
of the some data by tfnchvlt t (1970) shows this phenomenon even more clearly, 
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<3. The educational practices of a society will have an effect on the age at 
which intelligence levels off, for example, pushing up the age limit for compul- 
sory education ahould Increase intelligence. Xt la of interest, in this connection, 
that the 1960 revision of the Stanford* Bine t accepts the reality of mental growth 
for people in general to a higher age level than earlier editions of the test. 

The chango in occupational patterns from a concentrstlon of persons in manual labor 
to an increased proportion in more intellectual occupations should have a positive 
effect also. 

9. There are mean differences in intelligence among groups defined deroo- 
graphlcally. This proposition is one of tha best supported In the psychological 
literature. With the exception of sex differences on certain tests which were con- 
structed to minimise such differences, all sorts of demographic variables show 
differences on intelligence tests without resort to Ns of astronomical size. Race, 
section of country, rural-urban, location, education of parents, education of 
examinee, school attended, level of teachers 1 salaries, etc., etc. all show 
differences. Interpretation of these differences is another matter, however. 

With adequate experimental controls an analysis of varianca design could lead to 
estimates of percentage contributions to variance of psycho-social and biological 
substraten for the particular flxad levels of the independent variables studied. 
Rejults from a fixed varlabla design would hardly qualify as earth shaking In 
their Implications for the heredity-environment issue, but in the absence of 
experimental controls conclusions with respect to percentage contributions are 
hotter characterised as meaningless rather than as limited in generality. 

10. Among adult representatives of groups demogrsphlcally defined it will be 
difficult to overcome oxlstlng differences. This proposition is independent of 
tho attribution oi degree of importance to psycho-social and biological substrates, 
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or within the letter to genetic versus acquired biological differences. Psycho- 
social deficits ere not quickly end easily compensated for. Change takes place 
slowly. 

11. There will be genetic differences among members of deroographlcal ly 
defined groups whenever the definition of group accompanies aome degree of segre- 
gation of gene pooln. These differences will vary in sice and sign of the differ- 
ence from one of the very large number of biological characteristics to another. 

To take a concrete example, Negroes will be superior to Caucasians on some char- 
acteristics, Inferior on others- The summation of the effects on developing in- 
telligence from the entire gamut of biological characteristics will also show a 
race difference, simply because it is inconceivable that the algebraic summation 
of the effects of a very large number of partially segregated Independent causal 
factors would be tero. On the basis of present data it is not possible to specify 
cither the alee or sign of this overall difference though It la certainly smaller 
than present observed differences in performance. 

12. The selective breeding experiments with lower animals, such as those by 
Tryon (1929, 1940) could, with adequate controls, be replicated for high and low 
Intelligence groups in the human. While this experiment will probably never be 
done, and with good reason, it is still useful to suggest the hypothesis. C. 
summary of the controls necessary to reproduce the results with lower animals 
serves to make explicit the fallacies in tha thinking of those persons who place 
great weight on social class or caatc in human society. 

The experiment starts with upper and lower groups of subjects selected from 
the tails of the distribution of Intelligence . High subjects arc mated only with 
high and low with low; all average subjects from the first generation are discard- 
ed. Subjects in the rext generation arc again measured and offspring who do not 
O 
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meet the standards of their parents ore ruthlessly discarded. After about a dozen 
generations of highly selective mating and discarding of unwanted offspring, rats 
show two distributions of mate running ability with very little overlap. The 
genetic substrate for Intelligence In the human io probably more complex than the 
genetic substrate for mate running in the rat, lor one thing there are more 
chromosomes In the human, so that many more than a dozen generations would be 
required to separate bright and dull groups an equal amount in the human. 

Since the necessary conditions for the experiment are so greatly at variance 
with human breeding patterns, even in relatively highly stratified societies, 
t u ere Is no Justification from this hypothesis for an assumption of large, fixed 
differences In genetic substrates among existing social clessea and for the use of 
this reasoning as a bssla for a highly stratified class society. For example, 
there Is a common saying among conservatives that any revolution that abolished 
existing social classes would soon result in their reestablishment* While this 
seems to be true historically, and while It is reasonable psychologically as well, 
It overlooks two Important factors: the new superior class would be composed of 
different people than the old, with many coming from the lowest social class; 
and the offspring of the new class would be Inferior to their parents, Just as the 
present class that currently Is In a power position In a highly stratified 
society la Inferior to their parents who were in turn inferior to theirs. That Is, 
without both selectivity In mating and tho ruthless discard of Inadequate off- 
spring, an Initial genetic difference between persons of high and low achievement 
will diminish progressively in their descendants. 

St ability of Individual Dif ferences . Hypotheses In this area are mainly con- 
cerned with changes in the rank order of Individuals over time. Time Is not, of 
course, the effective variable, but in the absence of control >f type of experience 
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or growth time Is the appropriate dependent variable. Research thac will pin down 
the factors that occur In time that produce the Instability should have very high 
priority. 

It will also be noted that many hypotheses are parallel to those In the tman 
performance of groups section. It seems reasonable that changes In rank order of 
Individual differences will accompany changes in the mean status of groups. 

1. Stability coefficients will always be smaller than reliability coeffi- 
cients. Change la Inevitable. While this generalisation will be modified In 
subsequent theorems by auch variables aa age of the subjects, amount of time In- 
volved, and Intervening experience, raw change la the primary phenomenon. It Is 

a phenomenon, furthermore, with which psychologists concerned with prediction have 
not dealt In any systematic, comprehensive way. 

2, Stability ovei time is a function of the age of the subject. With In- 
creasing nge there Is greeter stability. This follows from the Increasing size of 
the Intellectual repertoire with Increasing age and the relative size of Increments 
to that repertoire aa a function of age. John Anderson phrased the principle in 
terms of the characteristics of the part-whole correlation, assuming that Incre- 
ments were uncorrelated with the base at the beginning of the period. While hia 
data were congruent with the latter assumption, It Is not necessary to make that 
assumption In ’’deriving" the hypothesis. A correlation between Increment and base 
tlmt Is lower than unity after correction for Attenuation la a sufficient condi- 
tion. Some degree of unpredictability of future learning or development Is 
■cqulrcd, but not complete unpredictability. 

It is well known that correlations between Infant and early grade school tests 
of Intelligence are approximately zero. This has traditionally been explained as 
due to a difference In functions reassured by tests at the two time periods. This 







31 



II 



explication Is unnecessary alnca change will tike place rapidly starting with the 
very small Infant repertoire. The data are almost ptcclsely what would be pre- 
dicted If the teats were measuring the name function. The only discrepancy between 
prediction and actual outcome, If It Is real in the sampling sense, Is between an 
expected small positive correlation and those obtained (Bayley, 1949) small 
negative correlation. 

The degree of lnetablllty of intelligence and the increasing degree of sta- 
bility with age, aro well shown in the intercorralationa of mental ages obtained 
in the Harvard Growth Study. Data for boys are shown in Table 3 and data for 
girls In Table 4. One can also see In these tables some evidence for a period 
of Increasing Instability around the pciio'L of sexual maturity. This secondary 
Instability appears earlier in the data for girls than for boys. 

3. Instability over time Is as characteristic of physical traits as of in- 
telligence. while there la a paycho*soclal substrate for height and weight, It Is 
reasonable to believe that the genetic substrate for height and possibly weight Is 
relatively more Important than for Intelligence. Change In these characteristics 
Is shown In Tablos 5 and 6. Height la clearly more stable than weight and both 
arc more stable than Intelligence, but ail show the same pattern of lntercorre la- 

t Inns . 

4. The amount of Instability Is a function of the amount of time between test 
and retest, holding ago constant. Tho continuous addition of uncorrelated or lowly 
correlated Increments to an Initial base results in more and more change In the 
rank order of individuals with the passage of time. Data previously presented in 
Tables 3 to 6 confirm this hypothesis. 

5. Change la more rapid with nartow than with broad functions. Other thing? 
bi'lnu equal change should be more rapid In verbal or quantitative ability alone 
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than in Intelligence. A prime exempli of thle hypothesis la the learning of a 
notor skill. The Intercorreletlone of trials, or blocke of trials, all obtained 
during a single learning session, show the same pattern of Instability found for 
Intelligence and height over a period of eeveral years. Changes In the rank order 
of Individuals obtained in the course of half a day for a very narrow, rapidly ac- 
quired disposition are comparable to those obtained over a period of several ye.vrs 
for a much broader, more slowly acquired disposition. 

Stability coefficients for the College Board f,eata for the period from Septem- 
ber to March were presented In Table 1 along with the gains made by students In a 
preparatory school. The aptitude tests show greater stability than the achievement 
tests. The former sample broader and older repertoires than the Latter. 

6. Change Is a function of the Intervening psycho-social substrate. With 
respect to Intelligence there should be more change In Individual differences for 
students in an academic curriculum than In a el.il led trade curriculum. There 
should be more change among a group of professional men than among a group of 
skilled workers. Change should also be dependent upon avocational Interests. In 
general, the greater the opportunity to add ;o the Intellectual repertoire, the 
greater should be the shift of Individual differences t s a function of the omount 
of time the exposure continues. 

harnqulst (1968) presented regression <:oef flc Lent n for the several groups 
of subjects studied, but with standard deviations made available (1969) these enn 
he converted to correlations. The within group correlations for the four major 
crttcp.oricr. of type of education arc prosentel In Table 7. Within group standard 
deviations ore also shown. The results are In the expected direction, hut they 
»r c nlso equivocal, The two groupa whose experience tuts presumptively been Less 
nc.idomtc have larger standard deviations which might alone produce the higher 
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correlations obtained* Thera Is, however, no applicable correction for restriction 
of range of talent. An Interpretation Involving restriction of range of talent, 
on thi other hand, depends upon equal unite of measurement In the several parts of 
the s>:ale which the possibility of a celling effect makes suspect. Again, as with 
the ranan gains, It can be said that the differences are not dramatic and that better 
control of Intervening experience than that afforded by type of schooling will be 
nocesi ary to teat the hypothesis more precisely. 

*/. Degree of incentive to learn or strength of motivation present in a group 
will he positively associated with amount of change. Students In a highly competi- 
tive academically oriented educational Institution will show more change In rank 
'Uder of Individual differences than will students in a more placid environment, 

;t is possible that persons in a free, fluid society will show more chrn^f than 
persons In a highly structured aoclety In which position Is dependent o - !■■,& s or 
caste. 

There appears to be no available evidence concerning this proposition. On 
an anecdotal basis, there may be more early stars that crash, and students that 
bloom late, at colleges such as Read and Oberlln than In state college. 

Investigator would, of course, have to control range of talent for any work in 
i Ills area . 

There arc data on amount of change in rank order of grade averages In ,e larpe 
state university over the four year time apan {Humphreys, 196B), but then: nrc 
picscntly no comparative data from other typos of institutions. The slice i op. ont 
••I change la sufficiently Impressive, however, to give Inferential suppjii to the 
1 Iieorrm, Intcrcor rc la t Iona of Independently computed semester avera ges <n (• "shown 
u Table d. It con be Inferred that the changes in grades parallel, to r 
u Jeait, changes in measured academic abilities. 
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8. The stability of individual differences In Intelligence from person to 
person among a set of related persons is nonzero, but less than the reliabilities 
of the measures. This proposition, furthermore, follows from the influences of 
both the pBycho-social and biological substrates. All substrates are involved in 
determining individual differences In intelligence and, except for monozygotic 
twins for whom there are no genetic substrate differences, *11 substrates differ 
among sets of related persons. The much discussed regression from parent to child, 
or from child to parent, for example, depends upon a finding of less than perfect 
correlations between parents and children snd nothing more. Attribution of cause 
to the genetic substrate without Independent assessment is barred here )ust as It 
Is in interpreting the I. Q. of an individual, Parents and children have different 
childhood environments, the children themselves have different functional environ- 
ments within the family, and different genotypes may interact with similar environ- 
ments In a very dissimilar feshloi . 

The genetic interpretation of family resemblances does have one advantage o vo»- 
an environmental interpretation in that th<2 degree of resemblance expected can be 
set with at least a modest degree of precision. The degree of precision must be 
called modest, however, because for psychological characteristics there is some 
dcgioc of assortntlve mating, and th.* herltablllty coefficient Is less than unity. 
Information is Inching as to the number of generations of assortatlvc mating there 
lus been and whether the same degree of resemblance between parents held in times 
po/jf ns In the present. Estimates of herltablllty of intelligence also vary from 
iboui ,83 at the tup to substantially lower values. By fixing either the genetic 
correlation, arising from sssurtative mating, or the digree of her 1 tabl 1 1 Ky , 
hypotheses Involving a range of correlation coefficients can be tested. 



‘J. Regression from initial standing t;o final standing In intelligence is 
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toward the mean of the identifiable subgroup of which the individual id a member. 
For the entire range of talent! without differential intervention In terms of 
Intervening experience, the stability coefficient for intelligence will be less 
than the reliability coefficient. As a result subgroups defined by score on the 
Initial test will regress more toward the population mean than would be expected 
on the basis of measurement error elone. Different forms of intervention may 
accelerate, retard, or reverse the expected regression toward the population mean 
and will Involve Instead the subgroup mean? 

An example of the importance of this theorem is available in the folklore of 
higher education. It has been said that the graduates of superior colleges ore 
no more superior then they were es entering freshmen. This allegation — firm 
data are lacking — la typically used to belittle the quality education claims of 
such colleges and places the emphasis on initial selection of the student body. 

On the basis of the present hypothesis, however, an institution that prevents the 
expected regression is doing a superior educational Job. 

It would not be difficult to obtain data concerning this issue. The College 
Bodrd aptitude tests and the Graduate Record Examination aptitude tests are suf- 
ficiently similar that one could be quite confident concerning equlpercentlle 
conversions based on a random sample of applicants for college admissions. Com- 
parison of pre and post test results for a variety of types of institutions would 
then be possible. There is one difficult matter thet Interferes with a .'oraplete 
assessment: the expected regression in tho population in the absence of differ- 

ential intervention is unknown. 

The present hypothesis Is intimately related to current social problems such 
as Integrated education and admission of marginally qualified students to college, 
but the proposition is not sufficiently precise at tlu moment to make the needed 

O 




3 



16 



prcdlc tions . Certain extras* cates setin clecr. A marginal student who quickly 
fails will not profit. A marginal student who la only slightly marginal and who 
survives should profit. Presumably each student should be pushed hard intellec- 
tually, but It Is also possible to push too hard. Out what la the result If the 
student Is kept In a generally superior learning environment by means of special 
sections or differential standards of evaluation? 

A partial answer to some of these question# is furnished by a reanjlysis of 
the data In the Coleman report (1966). Using partial correlation techniques to 
control (or variables such as socio-economic status, McPartland (1969) has shown 
that Integrated classrooms seemingly Increase the academic per foi mance of Negroes 
while Integrated schools having segregated classrooms do not. Similar studies 
need to be done on the academic performance of Caucasian children In Integrated 
schools and Integrated classroom*. 

The efficacy of the severs! components of a superior learning environment is 
hlfio unknown. In addition to faculty and facilities such as libraries and labora- 
tories It Is probable that the peer group itself 19 very important. If peers are 
Important, the Important ones would be the functional peers, or the significant 
pcets, not merely those who happen to attend the same institution. In large 
universities particularly there are large numbers of functional peer groups having 
vary diverse characteristics. Measurement of the characteristics of the functional 
peers, which 'stin (1965) has done on an institutional scale, should provide very 
useful information. 

Validities n f I nt* 1 1 igencc Tests. Test validities ore uBually described by 
correlation coefficient* jugt as arc the- stabilities of individual differences. 

When the time interval between test and criterion is the critical variable, parallel 
hypotheses result. In these cases hypotheses in this section are presented with n 
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minimum cf discussion. Mora attention will be given hypotheses concerned with the 
genera lltabl 11 ty over content of Inferences drawn from scores on Intelligence test. 

1- The extent to which predictive validities of intelligence tests will 
decrease with the passage of time Is a function of the age of the subjects, with 
Increasing age there is leaa shrinkage of the validity coefficients. 

2. The extent to which validities of Intelligence teats will decrease over 
time Is a function of the amount of time that intervenes between test administra- 
tion and the accumulation of criterion Information, Prediction of college grades, 
semester by semester, t?ould seem to be an appropriate setting to test this hypo- 
thesis. The data obtained, which show that the problem Is more complex experi- 
mentally than It appears super f Iclal ly to oa, are presented in Tible 9. The 
predictive validities (Humphrey#, 1968) fall off very nicely in accordance with 
the hypothesis. The postdlctlve validities (Humphreys, 1970) show that there bos 
been a change in the rank order of students' academic abilities as a function 

of the educational experience, but the correlations for Junior and senior grades 
are not ss high ns they should be If only changos in abilities were Involved. 

While the hypothesis Is supported, the amount of change was overestimated from 
the predictive validities alona. 

3. Validity coefficients change more for narrow than for broad functions. 
Wechslcr- Bellevue intelligence quotients should show a less steep gradient of 
validities than a college admissions test, since the* Wechelor test represents a 
broader gamut of abilities than does the typical college admissions test. 

Empirical support for this hypothesis can da obtained from the postdlctlve study 
discussed under 2 above. Table 10 contains a comparison of correlations between 
the ’’aptitude' 1 and advanced test sections of the Graduate Record Examination and 
qerrepter grades. There la clearly more change in the correlations for the 
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narrower tests (Humphreys, Ibid*). 

4. The gradient of validities over time U a function of the cc tent of the 
intervening psycho-social substrate. There should be a steeper gradient for Intel- 
ligence tests in a highly academic curriculum than In a skilled trade curriculum. 

b. The gradient of validities over time Is a function of the degree of motl- 
vatlon present In the group. Slae of validities In a highly competitive academic 
Institution will shrink more than th?se In a more placid environment* 

6. Gradients of predictive validities of Intelligence tests are accompanied 
by similar gradients of postdlctlve validities. The gradients are not necessarily 
Identical In shape, hut age, time, And Intervening experience will have similar 
effects both fo,ward and backward In time. 

7. Intelligence tests have a broad spectrum of concurrent and predictive 
validity coefficients. The Intelligence teBt is broad, covering verbal, numerical, 
flgural, and pictorial Items requiring a wide range of types of responses such as 
association, comprehension, Induction, deduction, memorisation, etc. on the part 

ol the examinee. Furthermore, desirable qualities are positively correlated. 

a result it Is difficult to find a criterion measure In the full range of 
talent for which nn Intelligence teat doee not have a positive nonzero validity. 

d. It follows from 7 above that differential validity of narrow aptitude 
tests Is difficult to eatanllah In the full range of talent. The restriction of 
range associated with passage through the educational hierarchy affects the general 
factor primarily so that differential validity patterns are more readily observed 
In restricted populations such as college undergraduates. Validation studies In 
the military enlisted population support strongly this proposition. 

0 , Even though the validity spectrum Is broad the very highest validities 
ore obtained in educatLonrtl settings. Test content Is more like the academic 
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curriculum content than that of other common learning experiences. Within the 
educational setting the highest validities are obtained with criteria that have 
the most overlap In content with the teat. Prediction of l^ter reading compre- 
hension and proficiency In arithmetic are higher than predictions of spelling 
accuracy. Music, art, and athletics are ev'in less Intellectual, In the present 
sense of that term, than spalling. Correlations with grades In foreign language 
com sea stressing the spoken language are also low ind, by the same token, the 
performance In foreign language training is not very Intellectual. 

10. There are many Important criteria that ore not predicted highly by 
stores on Intelligence tests, An analysis based upon transfer principles is a 
reliable guide to the expected site of these correlations when test and criterion 
rellol lllties are held constant. For example, leadership, sales, and manipulative 
criteria are not predicted wall by intelligence tests. 

Psychologists have been able to rationalise low correlations with the latter 
two criteria, but the first has been difficult to accept. Acceptance Is made 
difficult by common beliefs concerning the nature and Importance of Intelligence 
and the Importance of leadership behavior In our society* Correlations ate fre- 
quently computed In a very restricted ranga of talent when leadership Is Involved, 
hut this Is only a partial explanation of their small sire. When the sane sample 
of officers Is sent bock to school for either officer or technical training, 
conclatlons with school grades become substantially higher than those previously 
obtained with rated officer effectiveness. In such comparisons, of course, the 
range of talent In Intelligence le constant. 

11. Thoory Is not now and will not in the foreseeable future be an adequate 
bnsls for the use of an Intelilgenco test in a new situation or with a new popula- 
tion of examinees. Accurate use of a test requires a regression equation or an 

O 





20 



equivalent actuarial table. It is not sufficient to decide that » teat will be 
correlated with a particular criterion. Making predictions concerning individuals 
or groups requires precise information concerning errors of csLlimiie, slopes of 
regression lines, and intercepts of regression lines. In spite of some 60 years 
of use of intelligence testa, furthermore, the amount of information required to 
use teats properly la still quite inadequate. The common definition of intelli- 
gence as a fixed general capacity along with the ease of making inferences from 
this interpretation is partially responsible for thia state of affairs. 

A case in point is the controversy concerning the use of intelligence tests 
for the "underpr ivi leged . 11 Typically, this boils down to a question concerning 
the use of teats for American Negroes. A consistent finding, though one not as 
broadly documented as it ahould be, ia that for periods up to about one year the 
same regression equation can be used for members of both Negro and Caucasian 
groups for the prediction of a variety of criteria. Within this body oC data 
there are some small exceptions to this generalization, chiefly with regard to 
the intercept of the regression of the test on the criterion, but the sum of 
these small intercept differences does not favor the Negro. 

The naive environmentalist who accepts the common definition of Intelligence 
as some entity inside the person may be dismayed by the above empirical findings, 
but they arc quite reasonable f<rom the point of view of the present theory. The 
intelligence teal predicts later Intellectual performance whether that perlormance 
he another test or n socially desirable criterion. It does this Just because both 
ucrnnlona sample overlapping Intellectual repertoires. The amount of overlap and 
the inpldlty of change ore functions of tbs variables previously discussed In 
this chapter. 

12. i necessary empii teal basis for concluding that low on-the-job valid!- 
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t lei, as opposed to high training validities! demonstrate that the job and training 
situations are functionally different involves both a predictive and a concurrent 
validity for measures of the same disposition, A low long range predictive 
validity snd a high concurrent validity show that the people in the sample have 
changed, A low concurrent validity for an intelligence test, when the intelligence 
test was highly correlated with early training criteria, along with a significantly 
higher correlation for a test of some other disposition, is a necessary condition 
for concluding that training and Job ability requirements ore Indeed different. 

There have been many clalma that on*the*Job criteria have little relationship 
to intelligence. a matter of fact some writers have gone so far as to claim 
that this is a nearly universe! phenomenon. An implicit Assumption basic to the 
claims that have been made to date la that abilities are fixed. Once this assump- 
tion is questioned, the controls that no one previously considered become essen- 
tial. 

13* Early training success la not a criterion of the degree of Importance 
that it has assumed in test validation. The first 6 hypotheses concerned with 
validity are sufficient grounds for this assertion. In the absence of ability to 
predict changes, for many selection purposes retention or turnover baa many 
attractive characteristics for criterion purposes. In deciding between early 
training success and retention as criteria questions that must be faced, among 
others, ore the following! how much change takes place, how rapid is the change, 
how large are ttAlnlng coats, how much capacity for training is available or enn 
be obtained, what are the characteristics of fast learners that slow learners 
would replace, what are the differences if any between as / mptotlc performances of 
slow and fast learners? 

The proccdlng dlcuaslon doss not presuppose that roan is Infinitely trainable 
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or that Individual man are indefinitely trainable, but in the absence of informa- 
tion concerning capacity, which is not furnished by any aptitude test, one can not 
stake everything on initial training success* The only solution lies in more and 
better research. 

14. The lowering of standards of initial selection for a group will result 
in lower final performance avan though the tine span between selection and per- 
formance is sufficiently long to reduce validity coefficients to near zero, This 
hypothesis is based on a previous one to the affect that change within a subgroup 
given special treatment is about the mean of that subgroup rather than about the 
population mean. Since the present hypothesis is a secondary one based in turn 
upon an earlier hypothesis it crust be stressed that it is highly speculative. 

Although speculative, this hypothesis is needed as an antidote to a different 
and probably overoptimist ic Inference froa drastically reduced long term validity 
coefficients: namely, that initial solectlon does not matter* For example, in 

the well publicized World War II unsalected group of pilot trainees (DuBols, 

1947), if training standards had been reduced in line with the input, wculd the 
mean performance in the air of the group after training have been appreciehly 
lower than the performance of control groups even though correlations with on- 
the-Jol- criteria were essentially zero 7 There are no deta concerning this 
question, but It should have high priority In an applied research program. 
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Table 1 

Gains on College Board Scores ea a Function of Preparatory School attendance 



Scorn 


N 


September 
X S 


March 
X S 


Gain 


Stabi 11 ty 
Coefficient 


Verbal Altitude 


714 


471 


89 


528 


05 


57 


.81 


Mathemat lea 


Apt 1 tu Je 


715 


532 


99 


611 


94 


79 


.83 


Engl lah 


649 


458 


82 


540 


93 


82 


.69 


Into*, v.edlate 


Nnthe rot les 


610 


497 


89 


629 


100 


132 


.76 


Advanceo 


Mathematics 


251 


4 94 


05 


620 


96 


126 


.74 
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Table 2 

Gain flf a Function of Type of Intervening Schooling 



Type of Schooling 


N 


Initial 

Mean 


Final 

Kean 


Corrected Gains 
Retest Reliability 

Regressed Regressed 


Standard 
Ized Dlf f 


Compulsory Level 


1518 


36.38 


39. 08 


4.99 


6.23 


6.70 


Voca t Iona 1 


946 


39.46 


40.23 


6. 72 


7.50 


7.80 


Lower Secondary 


958 


44.49 


43.40 


8.58 


8.13 


7.96 


Gymnasium 


1194 


49.90 


47.00 


10.39 


8,55 


7.86 
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Table 3 

Intercorrelatlon of Mental Agee of Boys 
at Various Chronological Agee (First Test) 





8 


9 


10 


11 


12 


13 


14 


15 


16 


17 


8 




721 


712 


747 


729 


657 


598 


648 


652 


556 


9 


721 




751 


721 


714 


696 


634 


615 


609 


583 


10 


712 


751 




816 


769 


704 


726 


738 


699 


604 


11 


747 


721 


816 




839 


787 


745 


810 


002 


736 


12 


729 


714 


769 


659 




654 


778 


786 


806 


775 


13 


657 


696 


704 


787 


854 




864 


785 


770 


780 


14 


598 


634 


726 


743 


778 


864 




839 


778 


750 


15 


646 


615 


738 


810 


786 


785 


839 




868 


778 


16 


652 


609 


699 


802 


806 


770 


778 


868 




848 


17 


5 56 


588 


604 


736 


775 


780 


750 


778 


848 




18 


444 


509 


543 


638 


732 


754 


765 


744 


788 


828 
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18 

444 

509 

543 

638 

732 

754 

765 

744 

788 

828 



8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

16 



28 

16 

549 

607 

^10 

730 

6 / 

821 

830 

337 

857 

900 



Tab la 4 



Intarcorrtlatlona of Mantal /got of Girls 
at Various Chronological Ago* (First Teat) 



8 


9 


10 




730 


719 


730 




746 


719 


746 




761 


744 


812 


735 


774 


620 


661 


757 


794 


661 


705 


788 


719 


698 


784 


696 


723 


756 


603 


704 


709 


549 


607 


710 



11 


12 


13 


761 


735 


661 


744 


774 


757 


612 


820 


794 




684 


632 


884 




861 


032 


081 




304 


841 


871 


641 


846 


823 


837 


857 


830 


787 


844 


837 


730 


817 


821 



14 


15 


16 


661 


719 


696 


705 


698 


723 


788 


784 


756 


804 


841 


837 


841 


646 


857 


871 


823 


830 




065 


812 


665 




903 


612 


903 




817 


839 


912 


830 


837 


857 
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Tab l* 5 



Intercorrelations of Standing Haight of 275 
Girls at Various Chronologlct'1 Ages 





7 


8 


9 


10 


11 


12 


13 


14 


15 


7 




987 


980 


957 


920 


897 


887 


866 


836 


8 


987 




989 


969 


934 


914 


904 


882 


850 


9 


980 


989 




986 


954 


927 


909 


881 


844 


10 


937 


969 


986 




979 


947 


911 


865 


816 


11 


920 


934 


954 


979 




974 


923 


855 


790 


12 


897 


914 


927 


947 


974 




964 


887 


810 


13 


887 


904 


909 


911 


923 


964 




961 


901 


14 


866 


882 


881 


865 


855 


887 


961 




974 


L5 


836 


850 


844 


816 


790 


610 


901 


974 




16 


810 


824 


814 


780 


747 


763 


660 


948 


989 
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810 

m 

814 

780 

747 

763 

860 

948 

989 



TabU 6 



Inttrcorrelationa of Weight of 273 
Girls at Various Chronological Agea 





7 


6 


9 


10 


11 


12 


13 


14 


15 


7 




890 


880 


035 


810 


793 


755 


773 


744 


8 


890 




920 


896 


871 


856 


825 


812 


771 


9 


880 


920 




932 


906 


882 


840 


818 


773 


to 


835 


696 


932 




958 


936 


892 


842 


777 


11 


810 


871 


‘06 


950 




967 


921 


866 


790 


1 


793 


856 


882 


936 


967 




954 


892 


816 


13 


755 


825 


840 


892 


92 i 


954 




944 


830 


14 


773 


812 


818 


042 


866 


892 


944 




953 


15 


744 


771 


773 


777 


790 


816 


880 


953 




16 


732 


759 


756 


735 


762 


775 


639 


916 


965 
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732 

739 

756 

755 

762 

775 

639 

916 

965 



31 



TabU 7 

Stability as a function of Type of Intervening Schooling 



Type of Schooling 


N 


Initial S. D. 
(Within Groups) 


Final S. D, 
(Within Groups) 


Correlation 
(Within Groups) 


Compulsory Level 


1518 


8.44 


7. CO 


.67 


Vocational 


946 


8.29 


6.92 


.67 


Lower Secondary 


958 


7.48 


5.40 


.56 


Gymnasium 


1194 


7.35 


5.08 


.56 
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Table 8 



Intercorrelations of Independently Computed Semester 
Grade Averages for a Constant Range of Talent 
(N is approximately 1600 for each correlation/ 



I 11 

I 556 

II 

III 

IV 

V 

VI 

VII 

VIII 



III 


IV 


V 


456 


439 


399 


490 


445 


418 




562 


496 






512 



VI 


VII 


VIII 


415 


387 


342 


383 


364 


339 


456 


445 


354 


469 


442 


416 


551 


500 


453 




544 


482 






541 
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Table 9 



Comparison of Predictive and Poetdlctlve Validities 
of College Aptitude Teete 





I 


II 


III 


IV 


V 


VI 


VII 


VIII 


Predict Ive 


















ACT English 


345 


278 


226 


236 


236 


222 


216 


160 


ACT Math 


2 79 


189 


171 


171 


145 


162 


156 


121 


Poetdlctlve 


















CRB Verbal 


349 


308 


255 


268 


251 


218 


213 


163 


ORB Quant, 


348 


333 


31 1 


291 


275 


205 


170 


146 




Corrected 


to Freshoan Range of 


Talent 






Trcd let Ive 


















ACT English 


40 


35 


27 


25 


22 


22 


24 


20 


ACT Math 


40 


30 


25 


23 


20 


20 


18 


15 


Footdlc t tve 


















CiRE Verbal 


42 


40 


31 


33 


31 


28 


28 


21 


Grtt Quanta 


43 


43 


38 


37 


34 


26 


22 


19 
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Table 10 



Comparison of GRB Aptitude and Advanced Teat Validities 

(Correlations are computed within groups defined 
by Advanced Teat and then aggregated; Aptitude 
Teat validities differ somewhat from those In 
Table 9 which ware computed within College and 
sex. > 

Restricted Sample 





I 


n 


III 


IV 


V 


VI 


VII 


VIII 


Verbal 


297 


283 


262 


281 


275 


256 


223 


195 


Quantitative 


270 


246 


209 


233 


217 


215 


203 


6 


Advanced 


266 


304 


347 


339 


336 


343 


316 


258 






Corrected to 


Frtthnan 


Range of Talent 






I 


11 


III 


IV 


V 


VI 


VII 


VIII 


Verbal 


37 


36 


33 


36 


35 


33 


28 


25 


Quantitative 


34 


31 


27 


32 


28 


28 


26 


20 


Advanced 


36 


38 


43 


43 


42 


43 


40 


33 
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THE PSYCHOLOGICAL TEST 



The atatus of measurement in a discipline is intimately related to the status 
of both research and theory In that discipline* Little sophlatlcated research on 
electrical phenomena could be done until measuring devices such as voltmeters, 
ammeters, and ohnmetera were developed. Research was necessary in order to develop 
j the measurement devices, with the first "measurements" being simply presence or 
, absence of the phenomenon, but the devices also led to better research and theory. 
Furthermore, as measuring devices became more sensitive, the range of experiments 
possible was extended. The ability to measure in microvolts may represent ae 
important a step as the one from presence or absence of voltage to the first 
vol tme ter . 

Importance of the Test for Theory . It is not generally recognized, however, 
that the type of meaiureroent available in a discipline also affects research and 
theory. The psychological test, for example, represents a type of measurement 
device found Infrequently if at all elsewhere In the sciences. It la essential -fo 
understand the mature of tests if ons Is to understand experimental or observa- 
tional correlates of tests or the theory that la developed from those correlates. 

A scholastic philosopher can defioe Intelligence in the absence of measurt s of 
intelligence, but a psychologist qua psychologist can not do so. 

The preceding statement does not assume the necessity for sn operational 
definition of each tern in a scientific theory. A direct measure is not require 
for each theoretical con. .ruct, but there must always at some point be a return 
to data. The data in a discipline, in turn, depend on the measures. A theory 
that requires measures vhlch do not exist and which can not be developed is not 
testable and theories which are not testable are not acceptable scientific 
theories. 

It le the thesis of this book that moat theories of Intelligence are nor 

O 
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psychological test theory in some detail to lay the groundwork for the thesis and 
es a basis for developing a theory of intelligence that Is congruent with the 
experimental and observational correlates of measures of intelligence. 

Example of an Ordinal Scale . The first characteristic to be discussed is that 
the psychological test furnishes only sn ordinal scale of measurement. Suppose 
that an investigator wishes to measure the number of English words known by a 
particular population of people, e. g., high school students. He could define s 
population of English words by means of the unsbrldged dictionary and devise a 
method of sampling words at random from that population. (Note that a well defined 
population of te~* questions la not ordinarily available to the test constructer 
which creates a problem to be illustrated by later examples.) The investigator 
obtains a list of .00 words by his sampling method and present* this list to a 
random sample of the population of people In whom he Is interested and asks for 
definition. 

The answers given must be scored and to score ir. anything like an objective, 
replicable fashion a scoring key must be developed. Will the key demand word for 
word definitions n ^re or less as they appear in the dictionary? Will the test 
author accept Instead approximate definitions, Including reasonably close synonyms? 
Or will he be satisfied if the word Is used in a phrase or clause In a fashion tli.it 
conveys generally the meaning oi the word, Indicating at a minimum that the subiec* 
has seen the word used somewhere? 

Clearly the number of words that are counted as correct will depend on the key 
which in turn will determine the estimates of the total site of the vocabularies 
«"»t the subjects. The letter computation is made simply by multiplying the number 
correct on the test by the ratio of number of words In the dictionary to the 100 
sailed by the test, but the figure obtained is relatively meaningless. Depending 
O nature of tho test koy, h ' Individual's estimated vocabulary can vary 
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tremendously. 

Is anything gained by converting the teat from the original open-ended, or 
recall, version to an objective teat format such as multiple choice? The objective 
lest will be easier to score and with care the scoring can be accomplished with 
zero error, but subjectivity In the writing and interpreting of the key for the 
recall version has been pushed back Into the selection of the misleads, by 
selecting misleads that capitalise upon fine nuances of meaning, the test can be 
made very difficult, and the subjects may appear to have restricted vocabularies. 

On the other hand, by selecting misleads that require only the grossest of dis- 
criminations, the teat can be made quite easy. 

It would be easily possible by any method of test construction to obtain 
three vocabulary tests with quite different distribution characteristics in the 
same population of subjects. For example, one could obtain means and standard 
deviations approximating those In the following table for each of three randomly 
selected 100-itera tests with Just a little trial and error. 

Test A Test B Test C 

Mean 25 50 75 

Standard Deviation 20 25 20 

It la also reasonable to assume that there will be no gaps in the distribu- 
tions of scores and that rtost of the possible range of scores will be represented 
on each of the tests. Furthermore, in large samples of subjects, say 1000 or so 
the distributions, whatever theiv shape will appear quite regular. One does not 
need third and fourth momenta of the three distribution q furthermore, to draw 
Inferences about the shapes of the three distributions. Test A and C are skewed, 
though in opposite directions, while Test B, though probably symmetrical In dis- 
tribution, Is more platy-kuitlc than the normal distribution. These Inferences 
all follow from the relationships of the standard deviations to the means. 

o 
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Characteristics of Ord inal Scales . The linear Intercorrelations of the three 
tests may be quite hlgh--even a moderately well constructed test of 100 items for c 
high school population should be quite rel lable--but these correlations can not be 
as high as the test reliabilities would allow. Some high scoring people on Test A, 
each with different total scores, will have identical scores on Test B and even more 
of them will have identical scored on Test C. A similar finding for low scoring 
people on Test C will be noted in comparing their scores on Teat B and A. Such 
cases arise from the differential skews of the three distributions and regressions 
ire necessarily curvilinear. In general the number of unite by which any two scores 
differ In one distribution will be different than the number of units by which 
scores comparable In rank order differ in either of the other distributions. 

It Is also clear that ratios of scores computed for any one of the three tests 
ire urunfcerpretable. The number 50 Is twice 25, but getting 50 Items right vs. 
getting 25 items right does not have the same meaning for each of the tests. Zero, 
furthermore, is quite a common score on Test A, is much lees frequent on Test B, 
and Is very rare If It occurs at all on Test C. A score of zero on any test would 
not Indicate that the subject had a vocabulary of zero length. The accidents of 
sampling from the population of words are Involved, but more Importantly the arbl** 
i.rary decisions of the test constructer serve to make a score of zero meaningless 
with respect to absolute size of vocabulary. The only Information about a score ; ; 
zero furnished by the test Is that It Is smaller than one. 

The preceding characteristics of the three test scores define ordinal scales 
of measurement, fhe Information furnlehad Is basically rsnk-order information 
even though the numbers used have the appearance of an Interval scale. The rank 
orders of subjects Inferred from the numbers are not Identical for the three tests, 
even allowing for measurement error, because the differences In skew will produce 

O 

ed ranks on one test thst are not tied on another. If we utilize the rank-order 



information, however, and convert the obtained, raw scores to standard scores by 
aeons of a monotonic nonlinear transformation, 1. e. t by working through the per- 
centile ranks, we can increase the linear correlations among the three tests as 
compared to linear raw score correlations. 

Other examples will be presented to develop the argument in more detail and in 
more generality, but for the present the vocabulary test example can be taken on 
faith to represent the general case. Distributions of test scores are arbitrary. 
Tests furnish rank-order information only. Since a normal distribution has a 
number of desirable statistical properties, it is recommended that raw scores on 
tests be converted to normal distributions by means of the nonlinear transformation 
invoLving percentile ranks in a random sample of some defined population. When 
this has been done, the scale of measurement Is said to have been normalized, but 
it is still ordinal. It has not been converted to an equal Interval scale simply 
by means of the transformation. The choice of the normal curve conversion is a 
matter oi convenience not of scientific necessity or conformity to natural law. 

If convenience dictates a different type of distribution, e. g., quar tiles, 
deciles, or centlles for the converted scale, a different type should be used. 

The fact that measurement with a test Is ordinal is only the most obvious 
characteristic of this form of measurement. It is far from being the most Impor- 
tant. It will become clear later that ordinal measurement has little effect on 
reliability or validity. Moat of the Inferences from test scores that ore barred 
oy t lie lack of Interval or ratio scales are relatively unimportant and substitute! 
arv generally available. 

Poaslble Functions of Multiple Item s. In the example of the preceding section 
tests of 100- items were assumed. One hundred different words were selected at 
random from an unabridged dictionary and subjects were abked to define these or to 
select the correct alternative from a list of misleads. This suggests a propei ty 
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of the test that is of the utmost importance. 

Teats are typically compsed of multiple items or '’hurdles ’ 1 with the subject 
behaving in some fashion with reference to each item. In measuring ability the 
performance required la either right or wrong, but in measuring personality or 
Interest answers are frequently yes or no, like or dislike, etc. The total score 
on the test la also typically a linear combination of the scores on the items and, 
In many tests, weights for each item are either eero or one. It Is not an essen 
tlal feature of the test that the scoring be dichotomous, although dichotomous 
scoring Is found very frequently. It ia also not essential that the combination of 
items be Linear, but nonlinear combinations can be dismissed from this discussion 
on grounds that such combinations are used only Infrequently for research purposes 
and rarely if at all In standardised teats. The theory to be developed will 
assume a linear combination of dichotomous Items for the sake of convenience and 
for wide-spread applicability, but Che theory Is directly applicable to all other 
typeB of Item scoring with only minor mod if lcati jns . Major modifications would 
be necessary, however, to adapt it to nonlinear combinations. 

For th« peraon ateeped in traditional measurement theory the first hypothesis 
concerning multiple Items is that they are required for purposes of reliability. 

All measurement Is subject to *oraa degree of measurement error. A scientist or 
engineer frequently makes multiple Independent readings of his measures and user 
the mean of these as his best estimate of the r, true M value. Does the test differ 
In any respect from the need to take multiple measurements to reduce error? 

There is, Indeed, a difference in the practices of the engineer and of the 
psychologist. The person using a psychological test, when he wishes to Increase 
his precision of measurement, repeats the whole test or uses parallel forms of tno 
original test to obtain hla SAmple of measures for which the mean is the best 
£|*- 4 mate of the subject's ’’true" score. The total test score Is considered the 
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meo sure, not the acore on an item. It la true that Increasing the number of Items 
In the teat generally has the effect of Increasing the reliability of the total 
score, but Increasing reliability la not the primary function served by use of 
multiple Items. 

A second hypothesis concerning the function of multiple Items Is that they 
are required to furnish a scale more nearly approximating a continuous scale of 
measurement than does the dichotomous item. There are occasions, however, when 
It Is not merely desirable but necessary to add together multiple Items each of 
which Is measured on a continuous, equal interval scale, i. e., total score com- 
posed of multiple Items may be required even though the Items arc not dichotomous. 
Again, multiple dichotomous Items do furnish a scale approximating a continuous 
measure, but this Is secondary to their primary function. 

Tup Important Function Served by Multipl e I tems * The principal function 
served by multiple items is beat aeen aB a contrast between test theory per se a H 
traditional measurement theory. In the ordinary measurement of height It is rea- 
sonable to assume that each measurement operation for each person measured includes 
a true score component and a random error component. This Is the starting point 
for classical measurement theory. The variance of obtained scores includes the 
variance ol true acorea and the variance of error. Correlations Involving the 
obtained scores are a function of the covariances with true scores and the var- 
iances of obtained scores. From this basis such statistical concepts as the 
standard error of measurement and correction of correlations for attenuation by 
measurement error are readily developed. 

Classical measurement theory la not, however, readily applicable to the test. 
Efforts to use classical theory over the years, furthermore, may have hurt test 
development as much as It haa helped. The major departure from classical theory 
ERIC rises from the necessity to start with a definition of Item score that differs 
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slpnlf lcantly from classical theory. An Item score Is composed, as Loevlnger has 
discussed (1954) roost fully, of three distinct parts: score on the trait or dis- 

position (d) In which the test constructer Is Interested, systematic nonerror 
noise or bias elements or factors (b) f and random error (e). The Important effect 
of using multiple test Items Is to minimize the effects of the numerous nonrandom 
factors that are subsumed under the label of nolae. 

To return to the vocabulary example, a high school student may have en- 
countered a word In his recent reading for which he obtained a definition and 
when this word was encountered on the test he answered It correctly. The word 
may be difficult In general and the student’s general vocabulary level low, but 
he obtained an extra point in his total score for a nonrandom reason Independent 
of his general level of vocabulary competence. There are many such examples. 

Some words are encountered more frequently In science than In the humanities, In 
pulp magazines than In school books, or in certain neighborhoods or social levels 
than In others, and so on. By taking a large sample of words such effects, al- 
though still present, can be balanced off against each other, and the more genera} 
disposition to know the meanings of words will be measured with greater validity. 

In this connection it is instructive to look at Item lntercorrelat Ions for 
some fitandard ability testa. In tenHs that are quite homogeneous both with res- 
pect to difficulty level of the items and the subject matter of the test, item 
lntercorrelat Ions with a mean as high as .20 arc not common In the full range of 
talent and occur quite rarely In special groups who arc restricted In range of 
talent. While It Is not easy to assess the random error component of variance In 
an item- -memory for Item content makes suspect a repeated measures design and 
parallel Items are not easy to cons truct--lt Is probable that nonerror or bias 
factors are a major contributor to Item variance for most of the items that appear 
aychologlcal testa. 

ERIC 
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The vocabulary example suggests an alternative designation of nonrandom noise 
or bias In test items as Item sampling error* This Is Indeed one source of bias, 
but the equivalence breaks down In two Important ways. In many, many cases there 
is no defined population of Items from which to sample, though the use of item 
selection error could get around this difficulty. A more Important difficulty, 
however, Is that certain blaa factors are Intrinsic to psychological items. Every 
test Item has a particular Item format, a time limit or work limit, a set of 
directions. In addition, each examinee has a different background of knowledge, 
skills, sets, and other experiences. All contribute variance to a test Item. 
Reasoning la necessarily measured with verbal symbols, numerical symbols, or 
"tgural materials. Words that appear In a vocabulary test occur with differential 
frequency In different kinds of reading material. The use of noise or bias sug- 
gestr unwanted or even uncontrollable, which la desirable, and the use of systema- 
tic Indicates that the behavior measured Is lawful, which is also desirable. 

Just as a weed la any plant growing where it Is not wanted, systematic noise 
or bias Includes any factor or element appearing where It Is not wanted. One man T s 
noise, for one purpose, becomes another man's primary mental ability, for another 
purpose, But unlike weeds, a great deal of systematic bias Is Impossible to 
eradicate. 

The Correlation Between Test and Criterion . A little algebra may h r hclpfi 1 
at this point. Let there be n Items In the test and let d, b, and e represent the 
disposition the tef-t constructor desires to measure, the bias factors, and random 
error, respectively. The correlation of the test of the disposition with a 
criterion measure of the disposition is given by the following: 

r xy " r (x 4 + ... + X„) y “ nd *‘ Ch X * “ d < +b < +<! < 
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If error Is truly random, then all covariance terms Involving e will drop 



out. Paychometr iciane have typically been willing to make this assumption. 
Moreover, if the b tertn9 were specific to each item and independent of the dis- 
position score on the item, they would have the functional characteristics of 
error and covariance terms involving b would also drop out. While it is not 
difficult conceptually to assume orthogonality of disposition and bias factors, 
the assumption of specificity of bias to each Item la almost always false. 

It Is also reasonable to assume that the noise factors are unrelated to the 
criterion measure. (This can be considered true by definition.) With these 
considerations in mind, formula 1 can be rewritten as follower 



In the best of cases bias factors are minor sources of variance of total 
score on the test (denominator) and make zero contributions to covariance with 
the outside variable (numerator) currently of Interest. In the worst of cases 
the nonrandom bias variance of the test is antirely noise from the point of view 
of the alms of the test constructer, and the only nonzero terms with criteria 
Involve sources of variance other than the one the test is supposed to measure. 

By basing the total score on many Items, it is possible to build up the 
validity of the test for a particular disposition even though any one item has 
only a small component of that disposition. The secret is to spread item 
selection over as many bias factors as possible so that any one bias factor runt; 
through a minimum number of items. The goal, though frequently unattainable, is 
to make the bias factors specific to items. Even when it is impossible to keep 
the bias covariance terms near taro in the denominator, the scattering of this 
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variance Among many bias factors will avoid the situation in which the total score 
is a better measure of some other disposition than the one Intended. Many items, 
therefore, are a necessary though not a sufficient condition for building up the 
variance of a particular disposition in the total score on the test. A basic 
misconception concerning the original choice of items will result in the Kest 
constructer measuring something, with greater and greater precision as he con- 
tinues to add Items, that he does not wish to measure. 
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INTERRELATIONSHIPS OP HOMOGENEITY* RELIABILITY, AND VALIDITY 



the concept of homogeneity of a teat does not appear In classical measurement 
theory. Homogeneity with raspect to content is an Issue only In those situations 
In which multiple Items are used* The statement made earlier that multiple items 
did not aerve the same primary purpose as multiple measures In physics or engineer- 
ing, but did have a secondary effect on reliability Is Important In this connec- 
tion, Many psychologists are confuaed on thla laaue. Homogeneity Interacts vAth 
both reliability and validity, but must not be confuaed with either. 

Homogeneity and Reliability . Kuder- Richard son homogeneity coefficients are 
frequently called reliability coefficients. Under certain restricted circum- 
stances, it la true, one can obtain a reliability estimate from a measure of the 
homogeneity of the test, but it Is essential that the Investigator keep toe dis- 
tinction, end the conditions, char in hl» ovn mind and In his writing. 

The Kuder-Rlchardson formula best used to estimate reliability Is che follow- 



ing: 
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Only the number of Items, the difficulty levels of the Items, and the variance of 
the total score on the test are used. (This is algebralcly equivalent to the ap- 
proach to homogeneity of a set cf measures by means of the analysis of variance 
which Hoyt (1940) suggested.) The variance of the total score is, in turn, a 
function of the number of Items, the Item variances, and the Item covariances. 
These parameters do not have a ona*to-one relationship to reliability defined as 
thu correlation between repeated measures or between repeated parallel measures. 

The difference, and the relationship, between homogeneity and reliability can 
beat be shown by writing out the formula for the repeated measures correlation as 
a function of the relationships involvlrg the Items. (The prime refers to the 
repeated or parallel Item or test total score.) 
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If Item Intercorrelations are zero, the right hand terra in the numerator 
distppaara aa do the right hand tertna under each of the radicals. The reliability 
coefficient la then completely a function of the It era reliabilities and can vary 
from zero to one. Furthermore, one can conceive of a test In which these condi- 
tions would be rather closely approached. A scored biographical data Dlank, tor 
example, could contain lteias that were essentially uncorrelated with each other, 
but the reliability of answering an Individual Item would be very high. 

Ab the Intercorrelations of Itenia within and between tests approaches the 
correlations between the paired Items, tha Kuder-Richardson homogeneity coeffi- 
cient approaches the reliability coefficient of the teat. For the two to be 
equal in conception the Item difficulties would all have to be the same. Other- 
wise covarlancea are necessarily aomewhat attenuated. In practice, this latter 
condition can be Ignored alncc the formula doea take out the variance due* to the 
main i ffect of difficulty level, and variations of difficulty level within the 

norma) range have only a alight, biasing effect on the Interaction between persons 

? 

and Items which la the essential term determining the homogeneity of the test. 

'jhe Interaction between reliability and homogeneity Is more clearly seen If 
Formu .a 5 Is rewritten to make explicit i:he assumption that the retest or parallel 
measure Is Identical with the first; 1. ii. , test variances are equal and lnter- 
correlations within teats are equal to 1 ntercorrelatlona between tests. 
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The right hand quantities In the numerator '-ind denominator are Identical, The 
left hand quantities represent the ratio between paired item covariances and var- 
iances, With many Items In a teat the right hand quantities will generally be much 
larger than the left hand ones; the homogeneity of the Items, In other words, 
typically roakea a larger contribution to reliability than the reliability of ths 
paired Items. The test conatructer by narrowing the focus of his teat, 1, e., by 
redefining the deposition In which he la Interested to make It coincide with an 
Important aource of nonerror noise, can step up test reliability very easily. To 
suggest that this may Pe undesirable may seem strange to those Imbued with classical 
measurement theory. Why should not the ratio of true score variance to total 
variance be maximized? The answer is, of courae, that an Increase in reliability 
is not worth the price If the disposition which the psychologist seeks to measure 
Is redefined In the process of teat construction to make It less useful 
psychologies 1 1 y , 

Ae a matter of fact, the positive steps in teat construction that follow from 
the concept of disposition, nonrandom nolee, and error contributions to item 
variance make It difficult to achieve high reliability with a limited number of 
items. The variance of noise or bias factors must be spread around as widely as 
possible, The more successful the test conatructer Is In hla efforts, the lower 
will be the Item covariances. He can compensate for this effect only by Increas- 
ing the number of Items In the test. 

Homogeneity and Validity . No one administers a teat, however, simply to 
obt In reliable Information of some sort about a person. Testa are administered 
in order to make Irferences about bahavlor: Inferences about Jobs, school, mili- 

tary assignments In applied work, or Inferences about functional relationships 
involving a particular disposition in more basic research. Validity la a ahort- 
O d term used to cover the Inferences that can be drawn from a score. Validity 
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coefficients stated in terms of item characteristics were presented in Formulas 1 
and 2, but a simpler one will now be more convenient. This one Is stated in terms 
of the Items, without regard to their components, and their relationship with any 
outside variable, y . 
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Formula 7 shows that, Item validities being equal, there is a premium placed 
on low homogeneity. Item covariances occur only in the denominator. High relia- 
bility which comae about thtcugh an interaction with homogeneity is indeed a mis- 
leading goal. Only in case certain subsets of items in a heterogeneous test have 
zero correlations with the crlterloa does it pay to Increase the homogeneity of 
the test and obtain the concomitant Increase in reliability. When all items are 
related to the outside variable, by keeping item intercorrelationa lov, the var- 
iance of the test score will be kept low and reliability will te kept low, i t 
the sire of the validity coefficient Increased. Such reasoning is completely 
compatible with expectations based upon multiple regression theory, but it does 
require qualification of the classical theory concerning the relationship of 
reliability to validity. 

Reliability and Validity . Classical theory states the relaticnrhi p of 
reliability to validity Jn the correction for attenuation, 
r 






xy 



xy 



n/T 

4 XX 1 V yy 1 



erIc 



This formula Is applicable to test theory only in cases where reliability is 
changed by theaddltlon or, with appropriate chsngea in the formula, subtraction 
of Items of ttie same type as the originals. Whenever items are discarded 
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selectively with others retained, an increase in reliability may accompany a de- 
crease in the validity of the teat* Increasing the reliability of a test by 
doubling its length with exactly comparable items will Increase the test's valid- 
ity. Increasing the reliability of a test by item selection procedures will not 
have a predictable effect on the teat's validity. 

The same assumption, 1. e., adding exactly comparable items, must be made in 
estimating the reliability of a test of a different length than the original, but 
the Importance of the assumption In this case la better known. Ic may be in- 
structive, however, to apply It to the hypothetical situation described earlier 
in which item intercorrelations in the original teat are zero. 

When item lntevcorrelat ions are zero, test score reliability Is more cr less 
the mean of the item reliabilities. When the length of the test is doubled by 
the addition of exactly comparable items, Item covariances are no longer zero. 

The assumption of comparability means that each item In the original 

test now has Itself or a parallel version of Itself In the test of increased 
length. The new test no longer has zero homogeneity. 

Minimum Requiremen t for Homogeneity . Even though completely uncorrelated 
Items would be best to maximize the correlation between a test and an outside 
variable, such a set would not be considered to measure a psychological disposi- 
tion, It 1 8 here that the concept of the homogeneity of the test Is required. 
Various Indices of a disposition of psychological Interest Just ought to have 
something in common. If the disposition of glass tc shatter or beams to snap 
under the stress were Measured in a fashion analogous to the psychological test, 
the various items would be correlated Just as the items that measure height or 
weight in the other physical analogues to the test are correlated. 

If the reality, or even the necessity, for some degree of homogeneity is ac- 
® >pted, it does not necessarily follow that homogeneity should be as high as 
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possible within the limitations of obtaining a measuring device furnishing a 
nearly contlnuoua score* (If a teat of height were given very reliably, maximum 
homogeneity would result In a U-ahaped distribution. Some degree of spacing of 
Item difficulties, with the resultant decrease In Item covariances. Is necessary 
In a test for height to discriminate among the examinees.) Some degree of h mo- 
genelty Is expected, but the degree Is optional. The degree depends both upon 
the psychological facts, 1. e., the extent to which behavior Is dispositional as 
opposed to situational, and upon the breadth of the disposition that the psycholo- 
gist wishes to measure. He may be Interested In measuring Intelligence, or per- 
haps something even broader than Intelligence, or at the other pole in measuring 
the fluency with which four letter words beginning with s can be evoked. 

Let the rule be that any set of positively lntercorrelated Items can be added 
together. Such composites produce a psychologically meaningful total score, par- 
ticularly If the Items are lntercorrelated at about tho same level. It docs not 
matter whether this level Is low or high. If one Is trying to measure some dis- 
position that la very broad, and In consequence each Item may contain only a very 
small portion of the variance of the disposition, It will be necessary to plan on 
using many Items widely scattered in order to dissipate the rjany possible sources 
of nonrandom noise. That other factors will contribute to the total score Is of 
no consequence. As a matter of fact, the larger the number of these the better 
since this will tend to keep the contribution of each small. The restrictions 
that Item intercorrelatlens bo at about the same level Is necessary in order to 
avoid giving undue weight to a particular bias factor. It can, however, be 
relaxed If this Is carefully done. The bis* factors must themselves be evenly 
distributed . 

The Goal of High Homogeneity . In contrast to the preceding rule, those who 
O set high homogeneity of the test as their goal, Imply that only those Items can be 

ERIC 72 



7 



o 

ERIC 



added together that have the very highest level of Intercorrelat Ions . If any 
given test can be broken down Into subsets of Items whose Intercorrelat Ions are a 
little higher than the cross Correlations between subsets, the original test is 
'‘Impure" or heterogeneous and new tests should be constructed as defined by the 
subclusters of Items. 

The reader who believes the rule that any set of positively intercorrelated 
items can be added together Is ambiguous-rafter all there are many possible levels 
of Item Intercorrelat Ions and thus many possible tee tB--should ponder the ambi- 
guity In the high homogeneity rule. How high is the highest possible level of 
intercom relations t When does the correlation between two different Items become 
sufficiently high that It should be considered the correlation between parallel 
forms of the same Item? If this rule Is pushed to the extreme, does It not mean 
that the ultimate In homogeneity la reached when one reaches a small set of Items 
that are essentially parallel forms of each other? 

In discussing Blnet's Interest In multiple intellectual functions and the 
development of the Blnet scales of Intelligence, Guilford (1967) concluded that 
Blnet's decision to use a single score for the totality of his Items (mental age) 
was completely Incongruous. In the light of the present discussion Guilford's 
conclusion is simply Incorrect. One can accept multiple factors both of the 

*L. 

Thurstone sort and of the Guilford sort, which appear to be narrower than the 
Thurstone primary mental abilities, along with a general factor without any logi- 
cal or psychological difficulty. (Sea Humphreys, 1962, for a fuller discussion 
of this Issue). Ability Items are positively Intercorrelated to varying degrees. 
High Intercorrelations determine narrow factors; moderate Intercorrelations 
determine somewhat broader factors; low lntercorrelationa determine a general 
factor. It lo completely reasonable for an Investigator to measure with a single 
eet the factor or complex of factors that produce the lowest positive correlatiors 

73 



8 



o 

ERIC 



among broadly distributed Items. 

It ts Important to realize that the argument here Is empirical, not logical. 
Good characteristics as defined socially of the human being tend to be positively 
correlated. The most disparate abilities, with abilities used In the most general 
sense, such as correlations between clerical and mechanical abilities, or between 
information about farming and about social sciences, are positively correlated. 
Psychological dispositions of the abilities sort are also positively correlated 
with physical measures such as height and weight. Terroan's gifted children were 
healthier, wore fewer glasses, etc., on the average, than other children. 

This tendency for "all good thlnga to go together" ia much more marked when 
one samples from men or women In general in a given cultural group than when sam- 
ples are drawn from more restricted ranges of talent. Even within samples of 
college students enrolled In the ©oat highly selective Institutions correlations 
still tend to be positive though occasionally negative values occur which can 
generally be explained in terms of sampling errors. Negative correlations In a 
restricted population do occur, but there la frequently a sampling explanation. 
Highly selective universities can not play blg-tlme football without having 
separate standards for athletes and nonathletes* Correlations between athletic 
abilities and Intellectual abilities will be negative In such mixed groups. 

Arguments Pro and Con * One argument advanced against the broad test is thaf 
"purer" tests are better than more complex tests on grounds basically of scientific 
esthetics. Here there Is a difference In point of view as to what constitutes a 
pure test. Guilford’s factor pure testa are seen by the present writer as Inex- 
tricably complex. Tests of high homogeneity that measure one of the "aptitudes" 

In Cullford’s structure of intelligence reflect simultaneously variance Introduced 
by all of his three dimensions. Such tests are like the physical analogue In which 
weight was measured by having each subject lie down In a uniform manner at the en< 
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of the lever: scores were highly homogeneous, but reflected both height and 

weight in an unknown combination* 

A second argument against broad tests, and an almost convincing one, is that 
all of :he information in the most complex test is basically available in a Large 
number of highly homogeneous tests of the Guilford sort. Potentially also, infor- 
mation is lost by moving from many tests to a smaller number of broader tests. 
There are two different counter arguments to this point both of which are matters 
of feasibility. It would be very difficult to motivate examinees to sit through 
and work well for the amount of time necessary to administer 120 tests each suffi- 
ciently reliable to justify a separate score. It is alto very difficult statis- 
tically to obtain stable weights for 120 measures for the various sorts of 
inferences in which psychologis te are interested. It is not an exaggeration to 
estimate that the number of cases required would run into the tens of thousands 
for each outside vsrlable considered. This estimate has a statistical basis in 
the formula for the standard error of a beta weight and an empirical basis in tho 
ubiquity of positive lntercorrela t ions among Items and tests. 

It Is even possible gl^on optimum Ns for weighting purposes, little If any 
information would be lost by the use of broad te^ts carefully constructed (Hum- 
phreys, 1962). Tests of the analogue to the main effects in an analysis of 
variance for each of Guilford's dimensions might well furnish the same Information 
as the 120 tests representing the Cartesian product ot his dimensions. The 120 
different combinations of those dimensions become a source for selection of items 
and a guide for distributing noise factors as widely as possible In this con- 
ception, but not a mandate to construct 120 tests for assessment purposes. 
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ILLUSTRATIONS OP TRST CHARACTERISTICS BY ICANS OF 



PHYSICAL ANALOGUES 



It ie possible to develop physical analogue* to the test that help to clari- 
fy the principles that hava been presented. These principles will also be devel- 
oped more fully as the various physical analogues are discussed. 

Behavioral Test o| Height . A carpenter Is asked to make a series of stan- 
dards In the form of an Inverted L with the only apeclf Icstlrnr being that the 
uprights will all differ from each ot'n^r and that they will cover the range In 
height of adult men. It Is not essential that the horlrontal bar be at right 
angles with the upright, and the essential specifications are checked perceptual);* 
only. Each standard is given a separate designation, perhaps a number. A sample 
of men la drawn from a population; each man is confronted with each of the stan- 
iarde In turn In a uniform manner; and each man le given e score representing 
merely the number of tiroes the tip of the borleontal bar touched his head when an 
attempt Is made to pass It over his head with the upright being placed vertlc*Uy 
on the ground. 

If a very large cumber of standards Is constructed Initially, it ahouLd be 
possible to select from the larger group a smaller set having any specified dis- 
tribution of item difficulties. (This statistic Is readily computed: the number 

of h*ads hit by a standard divided by the total number of men In the sample 
measured provides a statistic varying from zero to 1.00 with high values repre- 
senting "easy" ite^s.) If 9 standards are selected having difficulty levels 
ranging from . l0 to .90 by steps of .10, the distribution of total scores for 
height will be rectangular In shape; 1. a., symmetrical but highly platykurtlc. 

It Is not difficult to see how the shape of the distribution of total scorer 
can be Inferred from the distribution of item difficulties. The easiest Item har 
i difficult/ Indsx of .9, 1. e. r 10* of the sample falls the easiest Item. Their 




aero. Another 10% fell the item having an Index 
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of . While 20X fall thi> item, one-half of the failing group passed the easier 
Item. Thus 107. will have a score of 1. In a similar fashion the 107, who passed 
the most difficult item passed all easier Items. Since this is the 9th standard 
in order of difficulty, only 107. of the sample will have a score of 9. 

With 107. of the sample at each score, the distribution 

la rectangular. 

It le Instructive to make a table In which the Items are placed in i horizon- 
tal array In order of Increasing difficulty end the subjects are placed In the 
vertical array In order of Increasing size of total score. (In this example only, 
all subjects having the same score can be represented by a single tally.) The 
result, In Table 1 ( la a triangular matrix of tallies vhlc* defines what has become 
known a9 a parfcct Gut t man scale. No man falls an easier Item after having passed 
p more difficult one. When the number of tallies In a column la counted and 
■ lvlded by the number of teople, the difficulty level of the Item Is the result. 
When the number of tellies In a row is counted, the total score on the test is s 
result. (A percentage score on the test is sometimes obtained by dividing the 
total score by the number of items In the test, but the number of Items and the 
y.ero point are quite arbitrary.) 

It is of Interest that the produc t-moment intercor relations of the Items in 
a perfect Guttman scale form what Cuttman has called a a Imp) ex matrix (Cutl'nrau. 
1955). The simplex matrix indicates the presence of a alngle underlying functlo: 
or factor when the successive variables differ In difficulty level, complexity, 
arewth, or level of learning (Humphreys, i960). The lntercorrela tlons ol the pr* 
sent example are presented in Table 2 for purposes of Illustration. It should 
also be noted In connection with this exttople that the presence of a single factor 
ts Inferred from the form of the correlational matrix end the known differences 
® culty level of the Items. The simple, one factor explanation cannot be oh- 
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tainc/i by the application of the usual factor analytic methods. If squared multi- 
ple correlations are alternated with unities In the diagonal, 

the 7 Items will define four principal component factors (Humphreys, 

I960). 

What would the test constructer do If he wished to obtain a normal distribu- 
tion of test scores for his measure of height? He would go back to his population 
of standards and select those having an appropriate distribution of Item diffi- 
culties. Difficulties ranging frcm .96 through .89, .77, .60, .40, .23, .11, 
to .04 would produce a distribution of total scores that would be approximately 
normal. The mean would be the same as the mean of the rectangular distribution, 
uut the standard deviation would be smaller. 

The test constructer by appropriate selection of Item difficulties can produce 
a distribution having any shape he desires. Wide variations In both kurtosls and 
Hcewness are possible. A test for the selection of basketball players can be pro- 
duced having a tall at the upper end of the score distribution. After all, the 
coach is not concerned about making discriminations among college freshmen who are 
In the lower quarter, or even half, of the distribution of height. U-shaped dis- 
tributions are possible though hardly useful. For a general purpose test, however, 
the test constructer does not worry very much about the distribution of item diffi- 
culties needed to produce a raw score distribution hsvlng a particular shape. 
Instead he takes what the accidents of Item selection produce and converts the rqW 
scores, by means of a nonlinear, monotonic transformation Into a distribution of 
converted scores. As Indicated earlier, It la frequently convenient that the 
shape of the converted score distribution be normal, 

A very important reason why the test constructer does not worry too much 
about the selection of item dlfflcjltles In most cases la that psychological items 

ne. One reason why this Is true Is the 

& 
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unavoidable presence of measurement error in each item. While the presence of 
the Guttraan scale made possible the charac terizatlon of the shape of the score 
distribution from knowledge of the distribution of item difficulties alone, it 
was also necessary to assume error free measurement In order to obtain that scale. 
The measurement situation hud to be highly standardized. Instructions to subjects 
were given to control posture; Instructions to the test administrator controlled 
the nature of the surface on which the subject stood and the placement of the 
standards relative to the subject. 

The Introduction of Measurement grror . This principle can be illustrated by 
returning to the test of height and the population of standards originally postv 
lated. The only change to be Introduced is that uniform conditions of cneasurerneut 
will not be specified. Posture will no longer be controlled. Neither will the 
measurement surface nor the placement of the standard be controlled. All of there 
villi be allowed to vary at random from subject to subject and from Item to lt?.m 
within a 3ubject . 

When cine items are now selected having the same equally spaced distribution 
of Item difficulties as before, the distribution of total scores will no longer 
be rectangular, but Instead will be unloodal. When subjects and Items are tahlcA 
as before, .nany person*: will be found who have failed an easier Item after having 
parsed a more difficult one. Something like the Item data In Table 3 will be -tlie 
result though Table 3 Is schematic only If It Is taken to represent more than 10 
subjects. The number of subjects who fall easier Items after passing, will be 
functions of the amount of error that has been introduced by the failure to 
standardize the measurement situation. 

Limitations of the Ordinal 4 Scale . The usual statistic expressing the 
reliability of measurement of a test Is the correlation between repeated tests or 

O 



;een p/irallel forms of the test. If the conditions of careful, standardized 
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measurement did produce a perfect Cuttman seels, the correlation between test and 
retest, or between two separate testa having identical item characteristics, would 
be unity. In a larger group of items occasional reversals would be found under 
even optimum conditions of measurement and the reliability coefficient would only 
approach unity. There ia no eaaentlal reason, however, why the reliability of the 
test of height should not be every bit as high as the reliability of the usual 
measurement of height. It Is all too easy, however, to be careless with any scale 
of measurement, and It Is probably easier to Introduce error Into the test than 
into the use of a physical scale of oeasurettert just because there are more occa- 
sions with multiple Items for error to occur. Without standardization of the 
measurement situation, as in the second example, reliability coefficients will 
deport substantially from 1.00. 

There is also no essential reason why the correlation between the test of 
height and the criterion measure of height should not approach unity. Lack of 
jnlform conditions for the test, aa well as cireleaenesa in the measurement situa- 
tion for the criterion, will attenuate the validity coefficient of the test, but 
there la nothing intrinsic to an ordinal scalo that produces a reduction In valid- 
ity. The only inferences barred are those involving equal units or equality of 
ratios and tho absolute zero. For most purpo3iia to eatabllsh converted scores in 
a meaningful population of subjects provides useful though not full sirslicu tris 
for the standard deviation (requiring equal units) and the mean (requiring an 
absolute zero) of the ratio scale. 

ProbGbly the roost important type of inference barred by an ordinal scale lii 
che characterization of the form of the functions! relationship between a psycho- 
logical disposition measured by the test, as tho dependent variable, and some ir.de 
pendent variable. There is no point in worrying about power versus log Ccnctlons > 
O example, If there is doubt concerning the equality of the units of measurement 
E -^ 80 
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By making certain assumptions about the nature of human judgment it is fre~ 
quently po99lble to get outside the limitations of the te9t ag here defined and 
obtain equal units. Problems of scaling have been discussed thoroughly by Tor- 
geison (1958). For present purposes It Is sufficient to add that Interval and 
ratio scales formed by such assumptions tnunt be thoroughly and ir dependent ly 
checked. Thus, the supposedly equal Interval attitude scales of Thurstone and 
Chave (1929) do not have Intervals that are equal Independent of the attitudes of 
the Judges who do the scaling. The lack o£ equality as a function of attitude of 
the Judge Is more marked for the equal appearing interval method of scaling than 
it is for the paired comparisons method, but it la not completely absent In the 
latter (see Edwards, 1957, for an extended discussion of these data). It Is also 
true that a Likert type scale (Likert, 1932), which is clearly ordinal in the 
seise here described, Is probably Just aa valid as a Thurstone scale (Edwards, 

957 ). 

Determinants of Teat Score Distributions . With respect to the shape of tin 
distribution of test scores, two generalizations are possible at this stage of 
the development: (1) the variability of item difficulties Is Inversely related 

to che variability of the distribution of scores on the test, or is directly 
related to a change In the form of the distribution toward leptokur top is . (2) 

The amount of error present In the testing situation Is Inversely related to th r 
variability of the dlstrlburlon of scores, or directly to a change In the form o l 
the distribution toward leptokurtoais. 

The second generalization above appears to be at variance with classical 
measurement theory. It Is enjy to prove In the classical theory or in the Loevi’*» 
ger variant of that theory that the variance of true scores or of disposition 
scores Is always less than the variance of obislned score*’. It Is not always 

, however, that the preceding conclusions demand an Interval scale cf treasutv- 
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ttkivl. For the test this means, when the same set of Items is administered once 
carefully and once carelessly, that two different ordinal scales are the result. 
The set of items administered carefully will have the larger standard deviation, 
but the ratio of the variance of true scores to error will also be larger in that 
set. In contrast, when height is measured carelessly on the physical scale, the 
variance of the obtained measures is larger, and the ratio of true score variance 
to error variance is smaller, than when height Is measured carefully on the same 
scale. 

Another way of illustrating these principles is to write the standard devia- 
tion of the dlatrlbutlon of total scores cr. the test In terms of the item charac- 
teristics. The effects of dispersion of item difficulties and the introduction 
of error can be observed in th'i item statistics. 

+ n c, 
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The largest contribution to total variance of the Item variance terms is 
obtained when p ° q - .50 for ail items. The largest contribution to the item 
covariance terms, on the assumption that all covariances will be positive {all 
iter,* a/e assumed to be measuring the same function), is obtained when p = q for 
nil Items, Wide variation In Item difficulties reduces the contributions of it or 
vo/linces and covariances. For most tests the covarisnce terms are m>re Impor- 
tant than the variance term* Juat because there are so many ©ore of tnera. 

Che effect of increasing the amount of random error in the measurement sit- 
uation comes about by wey of attenuation of the site of f he covariance terms. 

Error decreases the sire of correlations among obtained scores relative to true 
sr.oi'nn. The greater the arwunt of error, the smaller the site of the Item co- 
variance terms which reduces the sire of rhe standard deviation of test scores. 

The variance of a test score distribution attenuated In size by the presence cf 
/"rot of measurement will contain a larger proportion of error variance relatlv n 
ERJC true ore vArinnce than will the larger variance of an error-Uee test of th» 
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saro» number and distribution of item difficulties. Ths absoluts size of che stan- 
dard deviation is smaller, however, for the error-ridden test. 

It is obvious also that the size of the standard deviation of test scores is 
a direct function of the number of items. The addition of items of any type, 
error* free or error-ridden, will Increase the alee of the overall standard devia- 
tion. The addition of even a single Item to a teat changes the scale of measure- 
ment - 

The Introduction of Nonrandom Bias . In order to consider effects of nonran- 
dom noise or bias it will be useful to construct another physical analogue to the 
teat. A test constructer Interested in measuring walght has a lever, a fulcrum, 
and a pile of big rocks. In a pilot study he finds a place for the fulcrum that 
will allow the typical rock to Just about balance the typical male adult. (The 
use of ioean rock and mean adult hae been avoided to indicate that the pilot re- 
search does not have to be precise.) Again the rocks are each given an identifi- 
cation and each man in the sample is placed on Che lever opposite each of the roots 
in turn. The score la the total number of rocks raised in the air by the passive 
man. The test constructor has a partially incorrect theory about what hi is trying 
to measure however, and carefully instructs hia subjects to lie down on the lever 
with their bare feet proclsely at the end and with their heads extending toward the 
fulcrum as much at necessary. (He haa been thoroughly indoctrinated with the 
necesolty for care in measurement . ) 

Proceeding as before, tan items are selected which produce a rectangular dis- 
tribution anci a perfect Cuttman scale. Scores represent an unknown mixture of 
height and weight., but there are no data from the measurement operations alone th^t 
lead to this conclusion. As long as great care is taken in measurement, no man 
will *ail an easier item after passing a more difficult one. The first generalize- 




example, therefore, is thdt systematic measurement of nonrandor 
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noise Joes not necessarily reveal Its presence. 

If Che test ronstructer had Instructed his subjects to lie down as described 
on a specified one-half of the items and to stand with their heels at the tip of 
the lever on the other half, scores would still represent an unknown mixture of 
height and weight, but there Is a possibility that the presence of a second factor 
In the Items could be detected. In this new example of measurement of systematic 
bias, there are two types of Items which would be expected to show their differ- 
ential similarities In their correlational patterns which would in turn determine 
two factors. Unfortunately, Item correlations are affected by range of difficul- 
ties as well as by content (see Table 2 In this regard)* Differences in Item 
marginals may cloud the statistical differentiation between the two factors, but 
there Is hope in this example In being able to show the presence of the two fac- 
tors In the data. Even with careful measurement, when two factors are present In 
Mie Items, a Guttman scale will not be obtained; some subjects will fall or sU\ 
Items after having passed more difficult ones. 

Even If the two types of Items are separated perfectly on the basl9 of infor- 
mation Internal to the measurement operation, there Is no statistical clue from 
t\.^ Item data as to which Is the better measure of weight. The proper Identifier- 
tlon of the function each cluster is measuring might be made intuitively from 
Inspection of the Items In the separate clusters, but external relationships rer-rr - 
sent a more dependable means of Identification. Thus the factor analytic method is 
vulnerable on two counts: (1) the difficulties In factoring dlchotonously scorr I 

ttoms and (2) adequate identification of factors without pursuit of differential, 
external relationships. 

If posture Is allowed to vary from standing to sitting to lying down and li 

this variation occurs at random from subject to subject and from Item to Iter, 

a 

was nonrandom noise becomes random noise. The effects of error have been 
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descxibed earlier. There Is, however, an In- between condition which ia more 
oerlous. If poature vailes from subject to subject but not from Item to item, one 
subject 'a score may represent a relatively pure measure of weight while another's 
Identical score may represent a mixture of height and weight. Again, as In the 
first example, there are no Internal clues. A Guttoan acsle can be obtained under 
such conditions, for example, but Information external to the measurement opera- 
tion Is necessary In order to Identify those subjects whose measures are valid 
measures of weight. Without separation of subjects this type of nonrandom noise 
would depress validity coef f lclenti . If subjects could be separated, however, 
validity would be very high In one sub-group, quite low In another. 

Increasing Complexity of Bias Factors . Although a number of test construction 
principles have been demonstrated by means of physical analogues, up to this point 
In il '2 development there has been no precise, complete analogue for the most 1m- 
oortant reason for the use of multiple Items described earlier. Before Introduc- 
ing such an analogue It will be useful to return briefly to that argument. 

Tests require the subject to behave on each of a number of Items. An under- 
lying trait or dlaposltlon to behave In certain ways Is Inferred from the test 
score. Yet there are myriads of possible causes for behavior. Any one bit of 
oehuvior ru^y reflect the underlying disposition only In small degree. Knowledge 
of any one word does not Indicate very much about a disposition to know many wo vs 
This phenomenon was described In terms of an analysis of the test score Into dis- 
position, nonerror noise, and random error components. Individual test Items 
f.enc.ally contain much more variance from nonerror noise and random error than 
they do from the hypothesised disposition. 

A physical analogue that will Illustrate this property can again be devised. 

A test constructor without a tape measure hopes to measure height. He also h*s no 
enough to meet the criterion used In the first analogue example. 
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He can construct Items that will measure the length of toes, fingers, lover arras, 
lower legs, and head; various measures of width and depth of arms, legs, trunk, 
and head are possible; various circumferences can also be measured. Such Items 
could be dichotomous, but might even be measured by tape or calipers on the 
physical scale of measurements 

The first principle to note for this analogue Is that, if dichotomous, the 
items would depart radically from a Cuttman scale* Many, many failures on easy 
items after passing more difficult items would be evident. Long fingers do not 
typically accompany a broad chest* Also, for a given number of Items, the stan- 
card deviation of the test scores would be lower than in previous examples for 
items of similar difficulty levels and with equal amounts of error* The distrl- 
outlons would be unlmodal, even with minimus error In the measurement operations, 
anJ with little variability of item difficulties. Systematic noise or bias of 
this type, which Is typical of psychological tests, reduces mean item intercorre 
latlons substantially. The net effect on the test score distribution Is similar 
to the effect of error. With sufficient Item heterogeneity the test constructer 
docs not have to distribute item difficulties In order to have a useful ordinal 
scale. Item difficulties clustered closely around .50 will produce a U-shaped 
distribution with highly homogeneous items carefully measured, but the same distri- 
bution of item difficulties with heterogeneous items carefully measured will pro- 
duce a unlmodal distribution of total scores. 

Criteria for Item Selection * If there were an objective external criterion 
>f height available, (a) it would b« simple to obtain multiple regression weights 
for the set of items, (b) but under these circumstances It would be unnecessary 
to measure height with a set of items* The analogue to the test Is to combine 

these Items selectively In a linear fashion on the basis of Internal data alone. 

O 
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eijually weighted linear combination of ali of the Items Hated wuulo un- 
doubtedly produce a acore that would be rather highly correlated with height, 

Xhia acore would probably be more highly correlated with height than would any onr 
the components, but some sub-set of these items night produce on even higher 

correlation. Clustering or factoring night be possible, subject to the reserva- 

tions expressed earlier about the effects o£ disparate item difficulties, but this 
approach is far from simple. Depending upon the density of sampling of bodily 
measures one might find both finger and toe length factors, a long bone factor, a 
uody width factor or factors, circumference factors, etc. If only one factor is 
Co be used, it is probable that the long bone one is most highly correlated v’ich 
uvight, but the test constructor working v?ithout this knowledge would have diffi- 
culty Justifying this selection. From item data alone his grounds could only be 
intuitive, following inspection of the items, and the items obviously refer to 
lone length and not to full stature. Furthermore, it is highly probable thar the 

entire set of items carries more information about height than does tie long bon-i 

subset, so the problem of the test constructer is to bring in all of the useful 
information and exclude the useless information from his test. 

This situation can also be viewed in terms of the factor analytic methods 
with particular reference to the problem of factoring in several orders. With 
the use of very large numbers of items 09 contemplated in this test of height, 
there would probably be several first order factors defined by anatomical locatiui 
and by the dimension, i. e., length, breadth, or depth, measured. These factovr 
would contain relatively little Information concerning full stature; instead they 
repiesent mainly systematic noise. Factors more closely related to full stature 
wot Id be found in higher orders. With many width and breadth measures along with 
the measures of length, the factor in the highest order would probably represent 
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na) junctional relationships, In general, however, desired dispositions are not 
first order factors. 

The test constructer can proceed with simpler statistical methods, such as 
Item-total score correlations (Internal consistency Item analysis), than factor 
analysis hut there are several problems h«re. An Important one concerns the 
original item pool. If It contains substantial numbers of width and depth mea- 
sures, Item selection by means of the total score correlations will lead to a 
measure of body volume rather than of stature. Secondly, there are no criteria 
for deciding at what point to exclude an item from o test* If the original Item 
pool Is approximately correct, but 1.* homogeneity atandards are set too high, use* 
till Items will be excluded; If set too low Items not contributing to the measure- 
ment of height, as distinguished from the correlated measures of weight, will be 
included. 

Test constructors frequently use the steepness of age or grade curves fot 
items administered to children ae a criterion of item selection, but there are 
many functions that Increase with Increasing age. Such age curves have been com- 
monly used In the development of tests of intelligence, but their use has again 
beuw dependent on the original Item pool. Items from the pool not showing the 
expected relationship with age are discarded objectively, but many, many items 
that would have shown the expected relationships with age do not appear in thj 
pool. For Intelligence tests the choice of Items for the pool has been based 
upon theory, tradition, and availablll ty--and not necessarily In that order of 
'nr >or tonce . 

Without recourse to an external physical measure of height the test con- 
Blister can only make use of the network of functional relationships Involving 
his test with measures of other functions as a check on his Item selection. The 

o 

rations for Items of the functional relationships Involving total scores are 
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selection Is a slow, arduous, and ambiguous task. 



Sugnary of Multiple Item Function . It Is hoped that the function of multiple 
Items, as well as the difficulties Inherent In their use, has become clear. The 
constructer of a psychological test has no physical measure to use as a criterion. 
He measures bits of behavior (Items), each of which reflects the underlying dis- 
position in which he is Interested only to small degree. By adding the right items 
together he can build up the variance of the underlying disposition in the total 
score, but he needs to reduce random error and to spread nonrandom noise as much as 
possible. In the present physical analogue the test constructer is not interested 
in measuring finger or toe length as such. Instead his interest lies in their 
ability to give him a modicum of Information about stature. If he includes too 
many measures of finger and toe length In his stature score, there will be too 
much variance present from factors in which he is not Interested. Such Has may 
be present In sufficient amount to mask the Information about stature. By bringing 
in as many Indicants of stature as possible, and varying their distribution over 
the body as much as possible, the nonrandom noise while still present Is minimize, 
or spread over so many functions or factors other than height that the total score 
reflects height primarily. 

For a psychological function such as verbal comprehension used In an earlier 
illustration the availability of * population of words from which to sample rand'*: - 
ly, and In sufficient number, Is a very Important way In which to define the cen* 
rral function or factor that one wishes to maximise In the total score. However, 
tem populations from which to sample are relatively rare; when items are Invent** , 
the defining of a population is at best arbitrary and at worst Impossible. To 
distinguish between the nonrandom variance of the disposition and of noise involve 
^ long term research operation. The roost hopeful procedure If one Is restricted m 
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in orders beyond the first. Much more often than not f first order factors repre- 
sent systematic noise rather than the disposition in which the test constructor is 
Interested, while It Is only In the second or higher order that he finds the con- 
struct he Is seeking to measure. 

The above reasoning represents a complete break with a tr*<Htlon of test 
construction In psychology in which high item homogeneity has been °n Important 
goal. The traditional reasoning has been that, with high homogeneity, one could 
Infer that the test was measuring a unique, unitary function or factor. With 
sufficiently high homogeneity, the test becomes a Guttman scale though this is 
rarely attained. Scalability of a universe of Items, nevertheless, became a goal 
coward which to strive. 

This tradition leaves undefined the question as to how high the degree of 
homogeneity should be. This represents a formal objection to the homogeneity 
model, but the primary objection represented In this discussion Involves the 
nature of test Items. Iteraa necessarily Involve several kinds of components. 

Tor certain purposes one of these can be labelled the primary disposition while 
others are necessarily nonrandom noise and random error. Guttman scales are un- 
obtainable except under highly restrictive, even artificial, situations. 

Radical solutions should always be entertained, Including the possibility 
that separate Guttman scales should be constructed for each source of r.i : 
noise, while trying to hold constant all other sources of bias. What has been 
nolnu, in other words, becomes many tests. The feasibility of this solution is U\ 
9er>.ous doult, however, In terms of the sheer number of tests that would result. 
Ary estimate must be labelled a guess, but It Is a guess conditioned by test con- 
struction experience of psychometric tens generally as well us by psychological 
■jj;.s*unptior>& concerning the number of possible sources, or causes, of responses tc 
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homogeneous typically have Mail levels of intercorrelations less than .20. *n 
attitude measurement approximations to Guttman scales can be obtained jlth a tew 
itemn having very diverse Item popularities (defined statistically In a fashion 
: linilnr to Item difficulty) In which essentially the same question is asked with 
unur venations in wording. Correlations between attitude Items and logical 
reversals of those Items are not high. All in all, a guess that tens of thousands 
of tests would be the result Is not out of line. It seems better to retain the 
concept of nonerror noise and to allow test constructors freedom to broaden or 
narrow It at will, and In accordance with scientific convenience, rather than to 
impose the goal of high homogeneity on all testa. 
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l 

a + 

b + 

c + 

d 4* 

jrnons e + 

f + 

8 + 

h 4- 

i + 

J 0 



TABLE 1 

Score Matrix Which Producer a Perfect 
Guttman Scale 



2 

+ 

4* 

+ 

+ 

+ 

+ 

+ 

0 

0 



3 

+ 

4- 

+ 

+ 

4- 

+ 

+ 

0 

0 

0 



4 

+ 

4* 

4- 

4* 

+ 

+ 

0 

0 

0 

0 



Items 
5 6 



+ 

4- 

+ 

+ 

+ 

0 

0 

0 

0 

0 



+ 

4* 

4- 

4* 

0 

0 

0 

0 

0 

0 



7 

4* 

4* 

4- 

0 

0 

0 

0 

0 

c 

0 



8 

4* 

4- 

0 

0 

0 

0 

0 

0 

0 

0 



9 

4- 

0 

0 

0 

0 

0 

0 

0 

0 

0 



Test Score 
9 
8 
7 
6 
5 
4 
3 
2 
I. 

0 



i)f ( [ icul ty 
Level 



.8 .7 .6 .5 ,4 



.51 .2 



.1 
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TABLE 2 



Intercorr.letlons of the Item. tn a Perfect 
Gut t man Scale 





L 2 


3 


4 


5 


6 


7 


8 


9 


1 


.67 


.31 


.41 


.33 


.27 


.22 


.17 


.11 


2 




.77 


.61 


.50 


.41 


.33 


.25 


.17 


3 






.80 


.66 


.54 


.43 


.33 


.22 


4 








.82 


.67 


.54 


.41 


.27 


5 










.82 


.66 


.50 


.33 


6 












.80 


.61 


.41 



7 .77 .51 

0 .67 

9 
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TABLE 3 

Score Matrix (Schematic) in Which Measurement Error 
Has Been Introduced on Potentially Scalable Items 



a 

b 

c 

d 

Persons e 
f 

S 

h 

i 

) 



l 

+ 

+ 

+ 

+ 

+ 

f 

+ 

+ 

0 



2 

+ 

+ 

+ 

+ 

+ 

0 

0 

+ 



3 

+ 

+ 

+ 

+ 

0 

0 

+ 

+ 

0 

+ 



4 

+ 

+ 

0 

0 

+ 

+ 

0 

+ 

0 



Items 
5 6 



+ 

0 

+ 

0 

+ 

+ 

+ 

0 

0 

0 



0 

0 

+ 

+ 

0 

+ 

0 

+ 

0 

0 



7 

+ 

+ 

0 

0 

+ 

0 

0 

0 

0 

0 



8 

0 

+ 

0 

+ 

a 

o 

o 

o 

o 

o 



9 

+ 

0 

0 

0 

0 

0 

0 

0 

0 

0 



Teat Score 
7 
6 

3 
5 
5 

4 
4 
4 
3 
2 



Difficulty 

Lo v & l 



*8 



*6 .5 « 4 .3 .2 



A 
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Abstract 
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A methodology has been described and illustrated for obtaining an evaluation 
of the importance of the factors in a particular order of factoring that does not 
require factoring beyond that order* For example, one can estimate the intercor- 
relations of the original measures with the perturbations of the first-order fac- 
tors held constant or, the reverse, estimate the :ontribution to the intercorrela- 
tions of the original measures from the first-order factors alone. Similar 
operations are possible a*, higher orders. 
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EVALUATING THE IMPORTANCE OP FACTORS IN ANY GIVEN ORDER OF FACTORING 



Lloyd G, Humphreys, Ledyard R. Tucker, and Peter Dachler 



University of Illinois, Urbana 



One of us (Humphreys, 1962) has recommended hierarchical factoring of measures 
of human abilities for reasons connected with a presumed gradient of importance of 
factors in the several orders. One indication of importance is predictive 
validity. Broad tests have generally higher predictive validities than narrow 
tests. It is also very difficult empirically to find stable differential weights 
for a variety of criteria for very narrow tests. 

Valid objections can be raised to the evaluation of the importance of factors 
based upon correlations with outside criteria, but by factoring in several orders 
and using the Schmid-Leiman transformation (1957) to obtain a hierarchical 
orthogonal factor matrix an internal criterion can be obtained. The contributions 
to common factor variance of the several factors can be computed and compared. It 
occurred to us, however, that an internal criterion that did not involva higher 
order factoring would be useful. Such a criterion is readily available. 



We shall let R stand for any matrix of intercorrelations. The subscripts 0, 

1, 2, etc. will designate original intercorrelations, intercorrelations of first- 
order factors, intercorrelations of second-order factors, etc. Matrices of rotated 
factor loadings (Harman's pattern matrices on the primary axes) will be symbolized 
by P, also with appropriate subscripts. Estimated matrices are designated by the 
circumflex. Thus, we have the following well known relationships: 



Mathematical Development 




(la) 



(lb) 
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^0.1 " *1 ( *1 " V P 1 
* 1.2 * *2 <*2 ' *1 



An evaluation of the importance of the factors in a given order is obtained 
by replacing the unities in the diagonal of R with the estimated comnwn. ill ties of 
the factors and multiplying as before* We symbolize the new matrix, which repre- 
sents the estimated intercorrelations at a lower order with the perturbations of 
the factors at the next higher order removed, in a fashion analogous to partial 
correlations. Thus we have: 

(2a) 

, 2 (2b) 

Other matrices of interest can be derived immediately from the above. The 
direct contribution of the first-order factors, in contrast to the control of their 
effects, is given by the following: 

- R o ■ 

f A. 

1*1 ■ 

These direct contributions of the factors can be designated as ^ 2 3 k anc * 

A, 

R , . also in a fashion analogous to pi^tial correlations to indicate that the 

effects of higher order factors have been removed. Thus the entries in the matrix 



fl- 


A ^2 ■ 

*0.1 • *1*1*1 


(3a) 




*1.2 " *2*2*i 


(3b) 



02 



R. 0 , , indicate the contributions of the first order factors only to the 

U.c ( j. iK 

correlations among the original measures. It also follows that r q ^ 2 3 k = U 0 

which represents another example of the symbolism* 

It has been suggested that low U values for factors may be obtained when 

there is substantial capitalization upon chance (Horn, 1966), or when rotations 

have been contrived to force data into a particular structure (Humphreys, 1967). 

AZ 

Low values of U can also be obtained legitimately and objectively from the nature 
of the data. Whatever the basis may be, if t.ie matrices of contribution to corre- 
lations obtained in the fashion of Formula 3a contain values close to zero, the 
first order factors in question can be considered relatively unimportant. A 



o 
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factor may make only a minor contribution to covariation* A similar statement may 
be made for R. , , and second order factors. 

« K 

It is possible to estimate communal! ties for factors in the several ways that 
one estimates coramuna lities for the original measures, but under certain conditions 
squared multiples have much to recommend them. Multiple correlations, depending 
as they do on the entire matrix of intercorrelations, are more* stable than many 
other communality estimates. Although squared multiples are lower bound estimates 
only in the population of observations and approach "true" communalit ies only as 
the population of measures is approached, they tend not to be seriously in error 
when the number of measures, or factors as in the present case, is moderately large 
and when the number of observations is much larger. 

When the use of squared multiples is appropriate, it is unnecessary to factor 

in a higher order in order to use formulas 2 and 3 in evaluating factor importance. 

2 ^2 
Thus the error of estimate variance, symbolized as S , can be substituted for U 

in formulas 2 and 3. 

Vi • P 1 <"i • s *> p [ < 4 > 

'l.2,3..k- p l s2p i < 5 > 

We can now, in turn, let V = PS, which leads to the following relationship: 



0.2,3. .k 



V 1 V 1 



( 6 ) 



The matrix V contains the projection of the measures on the normals to the 
hyperplanes (Harman's reference vector structure). Consequently, these projections 
can also be used to evaluate the contributions of the factors. Small correlations 
wiMi the reference vectors of measures of high communality indicate that the 
important factors are in a higher order. 

Illustrations of the Procedures 

In order to illustrate these procedures we turned to published data. The 
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Adkins and Lyerly analysis of reasoning tests (1952) contains sufficient numbers 
of factors In the first order to allow use of squared multiples as comnunallty 
estimates. The same Is true of the first-order factors In Cattell l s study of fluid 
and crystallized ability (1963). 

It Is not feasible to present large estl-^ted Intercorrela tlonal matrices. 
Other indices must be found. Distributions of the diagonals of the several 
matrices constitute one compact way of describing the contributions of first order 
and higher order factors. Intercorrela tions of selected variables can also be 
shown as more concrete Illustrations of the effects of first and higher order 
fac tors • 

Table 1 contains the distributions described above along with means and 
standard deviations. The first order factors In the Adkins and Lyerly data are 
responsible for a higher percentage of the variance than the first order factors 
In the Cattell data. It should also be borne in mind in Interpreting these re- 
sults that the available cotranunall ty would ordlnr tf v be spread over many more 
first-order factors than all higher order factors combined. Furthermore, several 
of the Cattell first-order factors are specifics so that diagonals on L . . , 

U . c | U . . . tC 

are. In a sense, Inflated. 

The lesser Importance of Cattell’s first-order factors, It should also be 
noted, Is not critical with respect to his conclusions. His study was designed 
for higher order factoring. In other studies, however, In which the first-order 
factors are of prime Importance to the investigator, the technique here being 
Illustrated is a desirable, even necessary, check on the conclusions reached. 

In both of the two sets of data the correlation between the two estimates of 
the diagonal Is only moderately negative. This means that the effects of control- 
ling first-order factors or higher order factors are far from homogeneous with 

o 
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respect to the measures* Subsets of measures are differentially affected by 
factors in the several orders. Table 2, for example, contains estimated and ob- 
tained Intercorrelatlons of selected verbal tests from Adkins and Lyerly. 

Obtained correlations are below the diagonal while the estimated contributions to 
the correlations of first-order factors alone and higher order factors alone are 
above the diagonal. Within each cell abcve the diagonal the upper value repre- 
sents r q 2 3 k* l° wer value R Q ^ Thus we see that the Intercorrelatlons of 

the verbal tests In this analysis tend to be explained more by higher order factors 
than by first-order factors. 

Table 3 contains similar data for the Primary Mental Ability measures used by 
Cattell. First-order factors contribute only to the correlations between the 
parallel forms while the higher order factors account, as one would expect, for the 
intercorrelatlons of the different tests. The Fluency test for which no parallel 
form was available defined a specific factor In the Cattell analysis. This Is 
clearly seen in the correlation? presented. Furthermore, 39 of the possible 40 
correlations with other variables In the full r q 2 3 k matr * x are 8 n,a ll er than 
.10 for the Fluency measure, and the fortieth Is less than .20. 
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Summary and Conclusions 

A methodology has been described and illustrated for obtaining an evaluation 
of the importance of the factors in a particular order of factoring that does not 
require factoring beyond that order. For example, one can estimate the inter- 
correlations of the original measures with the perturbations of the first-order 
factors held constant or, the reverse, estimate the contribution to the inter- 
correlations of the original measures from the first-order factors alone* Similar 
operations are possible at higher orders. 

An estimate of communal Ky of the factors at a given level is required in 
order to estimate correlations at the lower level. When many factors are involved, 
squared multiples can be used for this purpose. Under these circumstances, also, 
the importance of the factors can be gauged by the size of the correlations of the 
original measures, or factors, with the reference vector structure for those 
measures or factors. 
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Table 1 





Distributions 


of Diagonal Values 


in Rq ^ and 


R 0.2,3. 






from Two Separate Analyses 






Adkins and 


Lyerly (1952) 


Cat tell 


(1963) 




R 0.1 


R 0. 2, 3. . .k 


R 0.1 


R 0.2,3 


65 




1 






60 




1 




1 


55 




1 




0 


50 


1 


2 




1 


45 


1 


6 




5 


40 


3 


13 


2 


4 


35 


7 


14 


4 


5 


30 


9 


16 


8 


6 


25 


7 


5 


15 


9 


20 


13 


5 


7 


7 


15 


9 


2 


2 


3 


10 


5 




3 




05 


3 








00 


5 








~05 


2 








-10 


0 








-15 


1 








X 


.22 


• 37 


.28 


• 33 


S 


.13 


.10 


.10 


.11 



o 




Table 2 



Comparison of Obtained and Estimated Intercorrelations 



of Selected Verbal 


Tests 


from Adkins and 


Lyerly 


(1952)* 






49 


50 


59 


60 


61 


62 


49 Reading 1 




30 


26 


21 


21 


44 






32 


32 


38 


36 


29 


50 Reading 2 


65 




21 


21 


23 


27 








36 


38 


37 


33 


59 Verbal Analogies 


57 


54 




17 


20 


28 








41 


41 


34 


60 Verbal Classification 1 


58 


59 


58 




30 


21 












48 


41 


6L Verbal Classification 2 


56 


60 


61 


87 




21 














40 


62 Vocabulary 


74 


65 


65 


63 


62 





* Measures are numbered as in Adkins and Lyerly. Entries below the 
diagonal are observed intercorrelations; the upper one of the pair o 
entries above the diagonal is from 2 3 k (^ r6t order factor 
contributions), and the lower one la from ^ (higher order factor 
contributions) . 
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Table 3 

Intercorrelations of PKA Measures 
from Cattell (1963)* 

1 2 3 4 5 



1 Verbal l 

2 Verbal 2 

3 Space 1 

4 Space 2 

5 Reasoning 1 

6 Reasoning 2 

7 Numerical 1 

8 Numerical 2 

9 Fluency 





48 


04 




38 


27 


86 




00 






29 


30 


30 




32 


27 


79 


41 


42 


21 


42 


41 


25 


34 


37 


23 


32 


33 


19 


44 


45 


17 



05 


00 


02 


26 


43 


39 


00 


00 


02 


28 


43 


40 


53 


-02 


00 


26 


24 


24 




-03 


03 




25 


23 


23 




40 






38 


25 


77 




16 


40 


37 


14 


43 


39 


22 


36 


33 



7 


8 


9 


-02 


-02 


08 


36 


34 


36 


00 


-02 


09 


39 


36 


37 


-04 


01 


05 


19 


13 


20 


-04 


01 


05 


20 


14 


16 


02 


04 


03 


41 


41 


34 


-02 


01 


05 


39 


38 


30 




44 


04 




33 


39 


78 




04 






36 


42 


40 





* Entries below the diagonal are observed Intercorrelations; the upper 
one of the pair of entries above the diagonal is from r q 2 3 k (^ rst 
order factor contributions), and the lower one Is from ^ (higher order 
factor contributions)* 
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abstract 



A major conclusion of the 1947 Scottish survey of intelligence was 
that there had been a gain since 1932 on the group test but no gain 
on the individual Stanford-Dinet test. This conclusion is marred, 
however, by the use of regression methods of equating the 1916 and 1937 
editions of the individual test for which only 89 cases were available. 
Avoidance of the sample of 89 cases who had been administered both 
editions oi the individual test b> the use of the equipercentile method 
of equation reveals parallel gains fo: i t he g rou p an< j the individual 

test. There is no need to qualify the conclusion that a small in- 
crease in intelligence among Scottish school children occurred between 
1933 and 1947. 
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Footnote to the Scottish Survey of Intelligence 
Lloyd G. Humphreys 
University of Illinois 

The results of the 1947 Scottish survey of intelligence (Scottish Council 
for Research in Education, 1949) were somewhat ambiguous, as reported, with 
respect to a gain in intelligence between 1932 and 1947. The group test, which 
was the principal survey instrument, showed a gair. The 1916 end the 1937 editions 
of the Stan ford- Binet were also administered in 1933 and 1947, respectively, to 
500 or more students of each sex. When the two editions were equated, by a 
procedure which will be described briefly below, the authors reported a slight 
loss for girls. Overall the mean intelligence quotients were almost precisely 
the same. The conflict in results between the group test and the individual tests 
has sometimes been interpreted as meaning that "real" intelligence did not change, 

The procedures used in equating the two versions of the individual test are, 
however, open to question. First, the authors used a regression method for 
equating group and individual test scores separately for the two sexes. Then they 
used a regression method for equating the scores on the separate individual tests. 
The equation of group to individual test involved 500 or more of the special cases 
for each sex, but the final equation of the two individual tests involved only 89 
cases of both sexes combined. 

A good case can be made against the regression method of equating scores on 
two tests each of which is supposedly measuring basically the same function. \n 
even better case can be made for forgetting about the 89 cases who had been given 
both versions of the individual test and base the equation of the two on their 
relationship to the group test. An N of 1000 Dr more for this step in the proce- 
dure has clear advantages over an N of 89 but use of the large N requires the 
equiperccnti le method of conversion. 



There are advantages to the use of the equipercent ile method over and beyond 
its applicability to the data based upon the large N. v t requires only ordinal 
scales of measurement and is thus independent of the shape of the regression. It 
is also independent of attentuation in the slope of the regression introduced by 
measurement error. It doej require the assumption that the two measures are 
equally valid measures of the trait, but the regression method requires a different 
and at least equally difficult assumption that one of the two measures can be 
considered the criterion measure of the trait# 

Table 1 contains comparable scores for boys and girls for each level of the 
group test on the 1916 and 1937 editions of the Stanford-Binet. Also included 
are the differences between the intelligence quotients for each level of the group 
test. When these differences are weighted by their respective Ns and averaged, 
it is seen that the mean difference in intelligence quotients for boys between 
the 1916 and 1937 editions is 3.14 units. The comparable figure for girls is 
'.99# There is clearly an interaction between sex and the two editions of the 
Stanford-Binet when the conversion is based upon the group test common to both 
testing periods. This same interaction is also seen in the results from the 6-day 
sample in which girls are slightly superior to boys on the group test and signi- 
ficantly inferior to boys on the 1937 revision. For some reason, Scottish girls 
seemed to be at a relative disadvantage on the revised Stanford-Binet in spite of 
the near equality of the sexes in the standardization samples. The latter were, 
of course, drawn entirely from the United States. 

There are two ways in which the equipercentile method can be applied. Equi- 
percentile conversions can be computed between the group test and each individual 
test, and then the individual tests can be converted to each other. Alternatively, 
the published regression conversions of individual test on group test can be 



3 . 



accented, but an equipercentile conversion can be substituted for the published 
regression conversion between the two individual tests. The latter will be 
designated the "mixed" method. 

Table 2 presents the published data on group test and the regression conver- 
sions of the individual tests in the first three numbered rows. These data are 
followed by equipercentile conversions computed by the present writer. The means 
in lines 2 and 4 which represent different estimates of the population I. Q.s on 
the individual tests are generally comparable though the equipercentile values 
are somewhat lower. This difference is a regression phenomenon, arising from the 
lack of perfect correlation between group and individual test. Conversions of 
1916 I« Q.s into 1C37 T. Q.s are contained in lines 3, 5, and 7. The two varia- 
tions of the equipercentile method are in general agreement in showing a gain in 
I. Q. for girls, but both depart radically from the regression results. Further- 
more, when the 1932 results are presented in the units of the 1937 revision of the 
Stanford-Binet (lines 6 and 8), the discrepancy between the results for the two 
sexes is very marked. 

The weak conclusion that can be drawn from this analysis is that the previous 
outcome of no gain on an individual test of intelligence between 1932 and 1947 is 
questionable since a different and supportable methodology demonstrates a gain. 

The strong conclusion, which accepts the superiority of the equipercentile 
methodology for problems of this type, is that gains on group and individual tests 
are approximately parallel and that the gain was greater for girls than for boys. 
From the latter point of view there is no need to qualify the conclusion that a 
small increase in intelligence among Scottish school children occurred between 
1933 and 1947. One is strongly tempted, furthermore, to p edict that the gain 
would have been larger without the disrupting effects on education of World War IT. 
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Even so the results are well in line with Tuddenham's demonstration of an increase 
in intelligence among men in the United States between World War I and World War 
II (1948). 




Ill 



5 



Table 1 

Equivalent Scores on the 1916 and 1937 Editions of 
the Stanford-Binet for Various Levels of the Group Test 





Boys 








Girls 




Group test 


1937 ed. 


1916 ed. 


Difference 


1937 ed. 


1916 ed. 


Difference 


69.5 


167.83 


149.17 


18.66 


154.50 


149.50 


5.00 


64.5 


151.17 


140.12 


11.05 


149.50 


139.50 


10,00 


59.5 


134.50 


128.46 


6.04 


133.71 


132.50 


1.21 


54.5 


125.02 


121.53 


3.69 


122.68 


119.71 


2.97 


49.5 


116.06 


112.94 


3.12 


113.72 


112,36 


1.38 


44.5 


108.46 


106.37 


2.09 


107.10 


107.05 


.05 


39.5 


103.96 


100.48 


3.48 


99.27 


100.01 


-.74 


34.5 


100.11 


95.33 


4.78 


94.01 


94.21 


-.20 


29.5 


95.77 


90.88 


4.89 


88.43 


90.22 


-1.79 


24.5 


89.85 


87.38 


2.47 


82.68 


86.17 


-3.49 


19.5 


85.96 


85.26 


.70 


79.27 


82.28 


-3.01 


14.5 


82.09 


81.57 


.52 


75.52 


79.05 


-3.53 


9.5 


77.71 


77.71 


.00 


72.17 


75,41 


-3.24 


4.5 


73.79 


71.64 


2.15 


67.23 


70.05 


-2.82 


-.5 


54.50 


54.50 


.00 


49.50 


64.50 


-15.00 


Weighted 


Kean Difference 


3.14 






-.99 
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Table 2 

Summary of Results for 1932 and 1947 







Soys 




Girls 








1932 


1947 


1932 


1947 


(1) 


Group Test 


34.503 


35.880 


34,409 


37.622 


(2) 


Regression Conversions 
Individual on Group 


99.86 


103.68 


98.56 


100.75 


<3) 


1916 on 1937 




100.48 




97.89 


(4) 


Equipercentile Conversions 
Group to Individual 98.29 


103.52 


97.67 


100.61 


(5) 


1937 to 1916 




99.91 




101.49 


(6) 


1916 to 1937 


101.70 




96.33 




(7) 


Mixed Conversions* 
1937 to 1916 




100,54 




101.74 


(3) 


1916 to 1937 


103.00 




97.57 





^Utilizes the regression conversion of each individual test on the group test 
(line 2) as the first step, but uses an equipercentile conversion as the second 
step. 
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